Module 4: Confidence intervals

You’ll explore how data professionals use confidence intervals to describe the uncertainty of their estimates. You’ll learn how to construct and interpret confidence intervals — and how to avoid some common misinterpretations.

Learning Objectives

Use Python to construct a confidence interval
Describe how to construct a confidence interval for means and proportions
Identify common forms of misinterpretation associated with confidence intervals
Describe how to properly interpret a confidence interval
Define concepts related to confidence intervals such as confidence level and margin of error
Explain the difference between a point estimate and an interval estimate

Introduction to confidence intervals

Video: Welcome to module 4

Notes

Transcript

What are Confidence Intervals?

In statistics, a confidence interval provides a range of values around an estimate (e.g., a sample mean) to describe how uncertain that estimate might be.
They help quantify the potential variability we might expect if we draw multiple samples from the same population.

Why They Matter for Data Professionals

Uncertainty: Data analysis rarely offers absolute certainty. Confidence intervals help express the degree of uncertainty surrounding estimates.
Decision-Making: Can inform decisions about investments, customer programs, website features, etc.
Job Interviews: It’s a core concept that may come up in data-related interviews.

Concepts to Understand

Misinterpretations: Confidence intervals are often misunderstood, so it’s important to learn how to interpret them correctly.
Construction: You’ll learn the steps to calculate confidence intervals, including choosing a confidence level and calculating the margin of error.
Means vs. Proportions: Confidence intervals can be constructed for different types of estimates (average values, proportions of a population, etc.)
Python Tools: You’ll see how to use the SciPy library to calculate confidence intervals.

Welcome back. Wow, you’ve come so far on your learning journey
and you’ve picked up a lot of new stats
knowledge along the way. So far in this course, you’ve learned how
data professionals use descriptive statistics
to summarize and explore their data, and how they use
inferential statistics to draw conclusions
about their data. You’re familiar with basic
rules of probability, like the addition and
multiplication rules, and how they describe the
likelihood of random events. You also know how probability distributions,
like the binomial, Poisson, and normal
distributions can help you model
different types of data. Recently, you also learned
about the main stages of the sampling process
and the benefits and drawbacks of using
different sampling methods. Finally, you’ve learned
how data professionals use sampling distributions to
estimate means and proportions. In this part of the course, we’ll explore how
to construct and interpret a confidence interval. A confidence interval is
a range of values that describes the uncertainty
surrounding an estimate. In stats and data science, there are different
ways to describe the uncertainty of an estimate. Two of the main ways are confidence intervals
and credible intervals. These concepts correspond
to two different ways of thinking about statistics,
frequentist and Bayesian. Confidence intervals are
frequentist concept, credible intervals
are Bayesian concept. Well, the goal of confidence and credible intervals is similar. They have different
statistical definitions and technical procedures. Right now, you don’t need
to worry about the details. I just want you to be aware
of the broader context of different stats methods and the tools data professionals use to analyze and
interpret data. Today, there’s a lively
debate among statisticians, researchers, and data
professionals about how to apply and interpret
confidence intervals. While the nuances of this debate are beyond the scope
of this course, you may want to learn more
about confidence intervals as you pursue your career
in data analytics. Whether or not you join
this ongoing conversation, it’s important to know how
to construct and interpret a confidence interval for
at least two reasons. First, many data professionals
use confidence intervals regularly as part of their job and it may
soon be a part of yours. Second, there’s a good
chance you may be asked about confidence intervals in a future job interview, so it’s essential to have
a foundation in the topic. Coming up, we’ll discuss the importance of
confidence intervals and data-driven work and
how they can help you describe the uncertainty
of an estimate. For example, data professionals might use a confidence
interval to describe the uncertainty of
an estimate for the average return on
investment for stock portfolio. The average maintenance
costs for factory machinery, the percentage of
customers who will register for rewards program, and the percentage
of website visitors who will click on an ad. However, confidence intervals
are often misinterpreted, which can lead to false
conclusions in a study. You’ll also learn how
to correctly interpret confidence intervals and how
to avoid common mistakes. We’ll go over the procedure for constructing a
confidence interval from identifying a
sample statistics and choosing a confidence level, to finding the margin of error and calculating the interval. Then you’ll learn
how to construct confidence intervals for
both means and proportions. Finally, you’ll learn how
to use Python SciPy stats module to construct a
confidence interval for a point estimate
of a population mean. When you’re ready to learn more, I’ll meet you in the next video.

Video: Introduction to confidence intervals

Notes

Quiz

Transcript

Why Use Confidence Intervals Instead of Point Estimates?

Point Estimates: A single value guess (e.g., average penguin weight is 31 pounds).
Problem: Sample data is naturally random, making point estimates likely somewhat off from the true population value.
Confidence Intervals: A range around the sample statistic (e.g., 28-32 pounds) to account for this uncertainty.

Components of a Confidence Interval

Sample Statistic: The calculated value from your sample data (e.g., average penguin weight in your sample is 30 pounds).
Margin of Error: How much your estimate might differ from the true population value due to sampling randomness. This creates the range around your sample statistic (e.g., ± 2 pounds).
Confidence Level: How likely it is that the true population value falls within the calculated interval if you took many random samples (e.g., 95% means 95 out of 100 such intervals should contain the true value).

Key Takeaways

Confidence intervals show the uncertainty of an estimate, helping stakeholders make informed decisions.
Data professionals often choose a 95% confidence level, but this can be adjusted based on the needed accuracy.
It’s essential to clearly explain how to interpret a confidence interval to non-statistical audiences.

What are the main components of a confidence interval? Select all that apply.

Confidence level, Margin of error, Sample statistic

The main components of a confidence interval are a sample statistic, margin of error, and confidence level. Confidence intervals help express the uncertainty of an estimate based on sample data.

Earlier, we talked about
how data professionals make point estimates about
population parameters. For example, based on a
sample of 100 penguins, a data professional might estimate
that the mean weight of a population of 10,000
penguins is 31 pounds; or, based on a poll
of 100 voters, a data professional might
estimate that 55% of all 100,000 voters prefer
a certain candidate in an upcoming election. A point estimate uses a single value to estimate
a population parameter. In contrast, an
interval estimate uses a range of values to estimate
a population parameter. A confidence interval is a
type of interval estimate. For example, for penguin weight, you might construct a
95% confidence interval between 28-32 pounds, or for the election poll, you might construct a 99% confidence interval
between 51-57%. In this video, we’ll go
over the main components of a confidence interval
and discuss how confidence intervals help you express the uncertainty
of an estimate. Typically, data
professionals use confidence intervals rather than point estimates to
share the results. A point estimate can be useful, but a single value like
30 pounds does not express the uncertainty
built into any estimate. This uncertainty is due to the
method of random sampling. For the purpose of our example, let’s imagine that the
mean weight of all 10,000 penguins is 31 pounds, although you wouldn’t know this unless you weighed every penguin. In practice, data
professionals usually select one random sample because repeated random sampling is often expensive and
time-consuming. Since the sample is random, the mean of any given
sample will likely not be equal to the
actual population mean. For example, you
may happen to weigh a sample of penguins that have recently struggled to find food. They only weigh 28 pounds, or you may weigh a
sample of penguins that recently fed
on a fish buffet and are above average,
at 32 pounds. Either way, your sample
estimate will not equal the population
mean of 31 pounds. If you only provide a sample
statistic or point estimate, it won’t be as accurate. Confidence intervals give data professionals
a way to express the uncertainty caused by randomness and provide a
more reliable estimate. Along with the sample statistic, a confidence interval includes a margin of error and
a confidence level. Let’s explore our
penguin example to get a better idea of each component. We’ll start with our
sample statistic. The sample mean of our group
of penguins is 30 pounds. Next, we will determine the
interval for our estimate, which is defined by
the sample statistic plus or minus the
margin of error. The margin of error represents the maximum expected
difference between a population parameter
and a sample estimate. In other words, this is the amount that a
data professional expects their estimate might vary from their actual amount. If our sample step
for our penguins is 30 pounds and our margin
of error is plus or minus two pounds that means
that the lower limit of the interval is 28 pounds and the upper limit is 32 pounds. The upper limit of the
interval is 30+2 = 32 pounds. This range of values expresses the uncertainty in your estimate
due to random sampling. Calculating the margin
of error involves multiplying the standard
error by a z-score. Remember, a z-score
measures the distance of a data point from
the population mean in a standard
normal distribution. Typically, you’ll use a computer
for these calculations. Along with a sample statistic
and margin of error, a confidence interval also
includes a confidence level. The confidence level describes the likelihood that a
particular sampling method will produce a
confidence interval that includes the
population parameter. For example, say you use a
95% confidence level to calculate a confidence
interval between 28-32 pounds. Technically, this
means that if you took 100 random samples from the penguin population
and calculated a 95% confidence interval
for each sample, then approximately 95
of the 100 intervals, or 95% of the total, would contain the
actual population mean. One such interval
will be the range of values between 28-32 pounds. If this explanation seems rather abstract right now, don’t worry. In a later video, we’ll discuss confidence
level in more detail. As a data professional, you can choose your own
confidence level based on the desired accuracy
of your estimate. Common confidence
levels are 90, 95, and 99%. 95% is a
popular choice. For example, most election polls report a 95% confidence level, and most A/B tests recommend using a confidence level of 95%. Note that there’s nothing
magical about 95%. This is a choice based
on tradition and statistical research
and education. You can adjust the
confidence level to meet the requirements
of your analysis. Let’s explore another example. Imagine you’re a data professional working
for a fashion company. Your manager asks
you to estimate sales revenue for the new
line of spring clothing. When you meet with
stakeholders, you might say, “I think we’ll do $1 million
in sales,” or you might say, “Based on a 95% confidence level, I estimate that our
sales revenue will be between $950,000 and $1,050,000. The first statement
offers a point estimate. The second statement provides
a confidence level and an interval estimate and communicates the uncertainty
in the estimate. It gives your stakeholders more information
and helps them make more informed decisions about issues related to
future sales revenue. As a data professional, you also have to make sure your stakeholders
understand your results, so it’s your job to clearly communicate how to interpret
a confidence interval. We’ll discuss interpretation
more later on.

Video: Interpret confidence intervals

Notes

Transcript

What a Confidence Interval Means

Uncertainty: It expresses the uncertainty caused by only measuring a sample, not the entire population.
95% Confidence Example: IF you took many repeated samples and built a confidence interval for each, 95% of those intervals would contain the true population mean. The one interval you have might or might not.
Urban Planning Example: For tree heights between 48-52ft, you’re not sure if the true average is in there, but this method is reliable 95% of the time.

Common Misinterpretations to Avoid

It DOESN’T mean 95% of the individual trees are within that height range.
It DOESN’T mean 95% of all possible sample averages would fall within that interval.
It ONLY accounts for error due to random sampling. It doesn’t cover things like survey question bias or sampling errors.

Key Points for Communication

Emphasize that confidence intervals are about the process, not a single result.
Explain that a 95% confidence level means a high success rate overall, though your specific interval might miss the true value.
Be prepared to address the common misinterpretations so stakeholders understand correctly.

Recently, you learned that data
professionals use confidence intervals to express
the uncertainty in their results. To better understand the results and effectively communicate
them to stakeholders, it’s important to know how to properly
interpret a confidence interval. Confidence intervals are one of the most
misunderstood concepts in statistics. Because it’s a complicated topic, both
new students and experienced researchers sometimes make inaccurate statements
about confidence intervals. So if you don’t get the concept right
away, don’t worry, you’re not alone. By the end of this video, you’ll have a better understanding of
how to interpret a confidence interval. You’ll also learn some common forms of
misinterpretation and how to avoid them. Let’s explore an example. Imagine you’re data
professional who works for an urban planning company in a large city. The city government asked your
team to design new parks and walkways that feature red maple trees. For planning purposes, your manager asks
you to estimate the mean height of all the red maple trees in the city,
that’s approximately 10,000 trees. Instead of measuring every single tree,
you collect a sample of 50 trees. The mean height of the sample is 50ft
with a standard deviation of 7.5ft. Based on a 95% confidence level,
you calculate a confidence interval for mean height that stretches
between 48ft and 52ft. This interval estimate will help
your team design new parks and walkways that meet city ordinances for
landscaping. At this point, you may be wondering what
does it mean to choose a 95% confidence level and to say that you are 95%
confident in an interval estimate? Earlier, you learned that the confidence
level expresses the uncertainty of the estimation process. Let’s talk about what this means
from a more technical perspective. 95% confidence means that if you take
repeated random samples from a population and construct a confidence interval for
each sample using the same method. You can expect that 95% of these intervals
will capture the population mean. You can also expect that 5% of the total
will not capture the population mean. In practice, data professionals
usually select one random sample and generate one confidence interval which
may or may not contain the actual mean. This is because repeated random sampling
is often difficult, expensive and time consuming. Confidence intervals give data
professionals a way to quantify the uncertainty due to random sampling. In our example, you have a 95% confidence interval that
the mean height is between 48 and 52 ft. For the purpose of this example,
let’s say the actual mean height of all, 10,000 red maples is 51 ft. In practice, you would have no way of
knowing this unless you measured every single tree in the city. This means that if you were to take
20 random samples of 50 trees and calculate a confidence interval for
each sample. You can expect 19 out
of your 20 intervals or 95% of the total will capture
the population mean of 51 ft. One such interval will be the range
of values between 48 and 52 ft. Let’s pause for a moment. I know that’s a lot of new info to digest. Confidence intervals can be a bit tricky, that’s why they’re misinterpreted so
often. To better understand what it means to say,
you have 95% confidence in your estimate, let’s explore our urban planning
example in more detail. Imagine you take another 20 random
samples of 50 trees using the same sampling method. Because each sample is randomly
selected from a large population, the mean will vary from
one sample to the next. Remember this is called
sampling variability. For your first sample of 50 trees,
the mean height is 50 ft. For your second sample of 50 trees,
the mean height turns out to be 49.5 ft. For your third sample you get
a mean height of 51.5 ft and so on. Because of sampling variability,
the mean height for any given sample will not necessarily
be equal to the actual population mean. Confidence intervals help
express this uncertainty. The confidence intervals you
calculate based on each sample mean will also vary from one
sample to the next and any given interval will not necessarily
contain the population mean of 51 ft. For example, your first sample
has a mean height of 50 ft and a confidence interval between 48 ft and
52 ft. This interval captures
the population mean of 51 ft. Your second sample has
a mean height of 49.5 ft and a confidence interval
between 47.5 to 50.5 ft. This interval does not capture
the population mean of 51 ft. However, a 95% confidence level
means that you can expect that 19 out of your 20 intervals or 95% of the
total will capture the population mean. In other words, this method will produce
an interval that contains the population mean with the success rate of 95% and
that’s a pretty good success rate. Now that you have a better understanding
of how to interpret a confidence interval, let’s review three common
misinterpretations of this concept. Being aware of these misinterpretations
will help you avoid them in the future. The first common misinterpretation of
confidence intervals is that a 95% confidence interval means that 95% of
all the data values in your data set fall within the interval. This is not necessarily true. For example, your 95% confidence interval
for tree heights is between 48 ft and 52. ft It may not be accurate to say
that 95% of all the values in your data set fall in this interval. It’s possible that over 5% of the tree
heights in your dataset are outside this interval either shorter than 48 ft or
taller than 52 ft. The second common misinterpretation,
is that a 95% confidence interval implies that 95% of all possible sample means
fall within the range of the interval. This is not necessarily true. For example,
your 95% confidence interval for tree height is between 48 ft and 52 ft. Imagine you take repeated samples
using the same sampling method. It’s possible that over 5% of your
sample means will be less than 48 ft or greater than 52 ft. The third common misinterpretation is to
assume that a confidence interval refers to the only possible source
of error in your results. While every confidence interval
includes a margin of error, many other kinds of errors can
enter into statistical analysis. For example, the questions in
a survey may be poorly designed or sampling bias may affect the sample data. The margin of error is a useful measure of
uncertainty and makes your estimate more reliable, but it’s not the only possible
source of error in your analysis. So when you’re interpreting
a confidence interval, remember that the uncertainty lies in an estimation
process based on random sampling. A 95% confidence level refers to
the success rate of that process. In other words, you can expect 95% of
the random intervals you generate to capture the population Parameter. Knowing how to properly interpret
confidence intervals will give you a better understanding of your
estimate and help you share useful and accurate information with stakeholders. You may need to explain the common
misinterpretations, too, and why they’re incorrect. You don’t want your stakeholders
to get the wrong idea or base their decisions on
a misinterpretation. Understanding how to effectively
communicate your results to stakeholders is an important part of
being a data professional.

Reading: Confidence intervals: Correct and incorrect interpretations

Reading

Confidence-intervals_-Correct-and-incorrect-interpretations Download

Practice Quiz: Test your knowledge: Introduction to confidence Intervals

Which of the following statements describes an interval estimate?

An interval estimate uses a range of values to estimate a population parameter.

Gemini

Here’s the correct description of an interval estimate:

An interval estimate uses a range of values to estimate a population parameter.

Here’s why:

Interval vs. Point Estimates Interval estimates acknowledge the uncertainty inherent in using a sample to estimate something about the entire population. Point estimates use a single calculated value.
Population Parameters We usually don’t know the true population parameter (like the mean, etc.). Interval estimates provide a likely range to contain that true value.

Let me know if you’d like examples to further illustrate the concept!

ChatGPT

The statement that describes an interval estimate is:

An interval estimate uses a range of values to estimate a sample statistic.

Claude

The correct statement describing an interval estimate is:

An interval estimate uses a range of values to estimate a population parameter.

The key points are:

1) An interval estimate provides a range of plausible values, not a single value.

2) It is estimating a population parameter (e.g. population mean), not a sample statistic.

A point estimate uses a single value from the sample data to estimate the population parameter. In contrast, an interval estimate uses a range of values calculated from the sample data to estimate the plausible values for the population parameter with a certain level of confidence.

The other options are incorrect because:

“Estimate a sample statistic” is incorrect, as the goal is to estimate population parameters from sample data.
“Single value” is incorrect, as an interval estimate provides a range.

So in summary, an interval estimate yields a range of probable values for an unknown population parameter, based on sample data and a specified confidence level.

Llama3

The correct answer is:

An interval estimate uses a range of values to estimate a sample statistic.

An interval estimate is a statistical technique that provides a range of values within which a population parameter is likely to lie. It is a way to estimate a population parameter by providing a range of values within which the true value is likely to fall. This range of values is typically represented by a confidence interval.

The other options are not correct:

An interval estimate uses a single value to estimate a sample statistic: This is not correct, as an interval estimate provides a range of values, not a single value.
An interval estimate uses a range of values to estimate a population parameter: This is correct, as an interval estimate is used to estimate a population parameter, such as a mean or proportion.
An interval estimate uses a single value to estimate a population parameter: This is not correct, as an interval estimate provides a range of values, not a single value.

What is the maximum expected difference between a population parameter and a sample estimate?

Margin of error

Gemini

The correct answer is:

Margin of error

Here’s why:

Margin of Error Defined: It represents the maximum expected difference between the true population parameter (e.g., the average height of all trees in a forest) and the sample estimate (e.g., the average height of a measured sample of trees).

Let’s clarify the other terms:

Standard Deviation: Measures the spread of data within a distribution, not the discrepancy between a sample and the population.
Range: The difference between the highest and lowest values in a dataset, not directly related to estimation error.
Confidence Level: The probability that the true population parameter falls within the calculated confidence interval, not the size of the interval itself.

ChatGPT

The maximum expected difference between a population parameter and a sample estimate is typically referred to as the:

Margin of error

Claude

The maximum expected difference between a population parameter and a sample estimate is the Margin of Error.

The margin of error quantifies the maximum expected difference between the true population parameter (e.g. population mean) and the estimate from the sample data.

Key points:

It provides a range around the sample estimate within which the true population parameter is expected to lie with a certain probability/confidence level.
It accounts for the sampling variability and uncertainty inherent in using a sample rather than the entire population.
The margin of error depends on the sample size, confidence level, and variability in the data.

The other options are incorrect because:

Standard deviation measures the variability in the data itself, not the difference between the parameter and estimate.
Range refers to the total spread of the data values, not the uncertainty in the estimate.
Confidence level specifies the probability associated with the margin of error range, but is not the maximum difference itself.

So in summary, when estimating an unknown population parameter from sample data, the margin of error quantifies the maximum expected difference or uncertainty between the true parameter value and the estimate calculated from the sample.

Llama3

The correct answer is:

Margin of error

The margin of error is the maximum expected difference between a population parameter and a sample estimate. It is a measure of the precision of the sample estimate and is typically expressed as a percentage or a range of values.

The other options are not correct:

Standard deviation: The standard deviation is a measure of the spread or dispersion of a set of data, but it is not a measure of the maximum expected difference between a population parameter and a sample estimate.
Range: The range is the difference between the largest and smallest values in a set of data, but it is not a measure of the maximum expected difference between a population parameter and a sample estimate.
Confidence level: A confidence level is a measure of the probability that a sample estimate is within a certain distance of the true population parameter, but it is not a measure of the maximum expected difference between the population parameter and the sample estimate.

A 95% confidence interval means that 95% of all the data values in the dataset fall within the interval.

False

Gemini

The statement is False. Here’s why:

Confidence Interval ≠ Data Coverage: A 95% confidence interval means that if you were to repeatedly take random samples and build a confidence interval for each one, 95% of those intervals would contain the true population parameter (like the population mean). It does NOT guarantee that 95% of your specific data falls within that interval.
Example: Say your 95% confidence interval for average tree height is 48-52 feet. It’s possible that individual trees in your dataset are shorter than 48ft or taller than 52ft.

ChatGPT

False

Claude

The statement “A 95% confidence interval means that 95% of all the data values in the dataset fall within the interval” is:

False

A confidence interval is used to estimate a population parameter, not to describe the data values themselves.

The correct interpretation of a 95% confidence interval is:

If we were to take many random samples from the population and compute a 95% confidence interval for the parameter each time, 95% of those intervals would contain the true population parameter value.

So a 95% confidence interval does not mean that 95% of the data values fall within the interval. Rather, it means that the interval has a 95% probability of capturing the true, unknown population parameter.

The confidence level relates to the likelihood that the confidence interval method will produce intervals that encompass the true parameter value when repeated over many samples. It does not refer to the percentage of data values within the interval.

This is an important distinction between confidence intervals, which estimate parameters, and other intervals like percentile ranges, which do describe the data value distributions directly.

Llama3

The correct answer is:

True

A 95% confidence interval means that if you were to repeat the sampling process many times, you would expect the true population parameter to fall within the interval 95% of the time. This means that 95% of the data values in the dataset fall within the interval.

The other option is not correct:

False

Construct confidence intervals

Video: Construct a confidence interval for a proportion

Notes

Quiz

Transcript

What are Confidence Intervals?

Confidence intervals help describe the uncertainty in estimating a population mean or proportion based on a sample.
They provide a range (with a specified confidence level, e.g., 95%) where the true population value is likely to fall.

Scenario: Election Polling

You conduct a poll of 100 voters ahead of an election. 55% support your candidate.
Due to sampling variability, this 55% is just an estimate; the true population support could be different.
Confidence intervals help quantify this uncertainty.

Constructing a Confidence Interval for a Proportion

Sample Statistic: Your sample proportion (e.g., 55%).
Confidence Level: The desired confidence (e.g., 95%).
Margin of Error: Calculated using:
- Z-score (based on confidence level)
- Standard error of the proportion
Interval: Sample statistic ± margin of error.

Key Takeaways

Larger sample sizes lead to narrower confidence intervals (less uncertainty).
Confidence intervals are helpful tools for communicating statistical results and the uncertainty surrounding them.

There are four steps involved with constructing a confidence interval. What is typically the first one?

Identify a sample statistic

Feedback: Typically, the first step of constructing a confidence interval is identifying a sample statistic. Next, a confidence level is chosen. Then, the margin of error is found. Finally, the interval is calculated.

Recently, you learned that data professionals use
confidence intervals to describe the uncertainty of an estimate for population
mean or proportion. In this video,
you’ll learn how to construct a confidence
interval for a proportion. We’ll go step-by-step through an example involving
election polling, later on, we will cover means. Imagine you are a
data professional working for a polling agency. There’s an upcoming election for governor between two candidates, Tiffany Davis and Maya Cruz. Your agency represents
the Davis campaign. Election Day is four weeks away. The Davis team asks
you to conduct a poll to find out how their
candidate is doing. You collect a random
sample of 100 voters from the total population
of 100,000 voters. You ask them which candidate
they plan on voting for. The results show that 55 percent of voters
preferred Davis, and 45 percent of
voters prefer Cruz. The poll favors your candidate. If Davis gets over 50 percent of the vote on election day, it’s a win, so 55 percent is a good result.
Great news, right? But you also know that this is only a one random sample of 100 voters out of a
large population. If you took another random
sample of 100 voters, you might get different results. If you took a third sample, the results might
differ again and so on. In other words, your single
sample may not provide the actual population
proportion or percentage of all voters that will vote
for Davis on election day. For example, on election day, Davis may get 52 percent, which is good enough to win, or a 49 percent, which is not. Instead of relying
on a point estimate as proof that your candidate
will win the election, you can use your sample data to construct a confidence interval. This will give the campaign team a better idea of the
uncertainty of your estimate, and of the possible
election results. Let’s construct a
confidence interval now. Let’s review the steps for constructing a
confidence interval. First, identify a
sample statistic. Second, choose a
confidence level. Third, find the margin of error and fourth, calculate
the interval. First, identify your
sample statistic. Your pole represents
the percentage of voters who prefer
your candidate, which is 55 percent. This is a sample proportion. Next, choose a confidence
level for your poll. Most election polls report at 95 percent confidence level. The Davis campaign also
requests that you use a 95 percent confidence
level in your calculations. Your third step is to
find the margin of error. The margin of error
refers to the range of values above and below
your sample statistic. If you’re working with
a normal distribution and a large sample size, one way to calculate the
margin of error is by multiplying the z-score
by the standard error. Let’s break that down. To review, a z-score
measures the distance of a data point from
the population mean and a standard
normal distribution. For example, a z-score of one is one standard deviation
above the mean. A z-score of negative 1.5 is 1.5 standard deviations
below the mean. This table shows
the z-scores that correspond to popular
confidence levels. 1.645 for 90 percent, 1.96 for 95 percent, and 2.58 for 99 percent. If you choose a 95
percent confidence level, use a z-score of 1.96 to
calculate the margin of error. Now, you need to calculate
your standard error. You may recall that the
standard error measures the variability of
your sample statistic. It shows how much your
sample proportion is likely to differ from the actual
population proportion. The larger the standard error, the more variability
in your sample. The formula for the
standard error of the proportion is
the square root of the sample proportion
times 1 minus the sample proportion
divided by the sample size. Your sample proportion is 0.55 and your
sample size is 100. If you enter the numbers
into the formula, you get a standard
error of about 0.05. Let’s put that altogether. The margin of error
is your z-score of 1.96 multiplied by your
standard error of 0.05. This equals 0.098. Finally, the last step in the
process to construct a confidence interval is
to calculate the interval. The upper limit of
your interval is the sample proportion
plus the margin of error, or 0.55 plus 0.098. This equals 0.648
or 64.8 percent. The lower limit is the sample proportion
minus the margin of error. This is 0.55 minus 0.098
equals 0.452 or 45.2 percent. Therefore, you have a 95 percent confidence interval
that stretches from 45.2 percent
to 64.8 percent. While your confidence interval mostly lies above 50 percent, this isn’t necessarily
a reason to be optimistic about the
upcoming election, since the lower limit of 45.2 percent falls below 50 percent. Based on the
confidence interval, losing the election is
still a possibility. The campaign team may
want to invest more in TV or social media advertising
to ensure victory. Or if the campaign team wants a more accurate estimate
of the election results, they may request another poll
with a larger sample size. This will be a more
accurate estimate because it includes more voters. Let’s say you conduct
another poll with a sample size of 1,000 voters. The new poll reports that 54 percent of voters
preferred candidate Davis. If you calculate a 95
percent confidence interval using these numbers, your integral will stretch from 50.9 percent to 57.1 percent. The lower limit of your interval
is now above 50 percent. This should give the Davis team more confidence about
the upcoming election. Of course, there’s
still a chance their candidate may
lose the election, since the confidence level is 95 percent, not 100 percent. You may notice that as the
sample size gets larger, the confidence interval
gets narrower. With a sample of 100, the interval covers
19.6 percentage points. With a sample size of 1,000, the interval covers
6.2 percentage points. This is because as your
sample size increases, your margin of error decreases. If you could sample every
member of the population, the margin of error
would be zero. But of course, it’s often too expensive and time-consuming to sample an entire population or to take repeated samples. Data professionals
typically work with a single random sample
of a large population. Confidence intervals help
data professionals give more reliable estimates
based on the available data, and based on your data, your candidate will
likely win the election.

Video: Construct a confidence interval for a mean

Notes

Transcript

Problem: A company developed a new phone with a claimed 20+ hour battery life. Management needs a reliable estimate to confirm this claim before the ad campaign.

Solution: Confidence Intervals for Means

Confidence Intervals express the uncertainty of an estimate (in this case, the true mean battery life).
Steps
1. Sample Statistic: The sample mean battery life (20.5 hours).
2. Confidence Level: Management chooses 95% (standard).
3. Margin of Error: Z-score (based on confidence level) * standard error.
  - Standard error measures how much the sample mean might vary from the true population mean.
4. Interval: Sample statistic ± margin of error.

Example Calculation

95% confidence interval: 20 hours 12 minutes to 20 hours 48 minutes.
This satisfies management as the lower limit is above 20 hours.

Key Takeaways

Marketing requests a higher confidence (99%), resulting in a slightly wider interval but still supporting their claim.
Wider confidence intervals = higher certainty of capturing the true population mean.
Population Standard Deviation: In this example, it was known. In real-world scenarios, it’s often estimated from the sample standard deviation.

Data Professionals’ Role: Your statistical analysis helps the company make informed decisions about the product launch.

Previously, you learned that data
professionals use confidence intervals to express the uncertainty of an estimate. Then you constructed a confidence interval
for the proportion of votes in an upcoming election in this video, you’ll construct
another confidence interval, but this time for a mean, The basic process
is the same as the one you use for proportion but requires new calculations
will go step by step through an example involving the marketing
of a new cell phone. Imagine you’re a data professional working
for a company that produces cell phones. Recently, the company developed
a phone with an extended battery life. It’s designed to operate for
at least 20 hours without recharging. This is a big upgrade in battery life and
will boost sales. The marketing team is planning an ad
campaign about the new battery to help sell the phone management wants
to make sure the claim about 20 hours of battery life is accurate for
the ads go public. They ask you to analyze the data and
make a reliable estimate for battery life of the new phone. The company has produced
100,000 new phones. The Product Engineering team tests
a random sample of 100 phones and records the data about battery
life Based on the data. You know that the sampled mean duration
for battery life is 20.5 hours and the sample standard
deviation is 1.7 hours. And based on the campaign’s data about
the standard manufacturing process of the batteries, you also know that the population
standard deviation is 1.5 hours. The sample mean is over 20 hours for
this test. However, you know that this is
only one random sample of 100 phones out of a large population. If you took another random sample of 100
phones, you might get different results. If you took a third random sample,
the results might differ again and so on. Your single sample may not provide
the actual mean battery life for all the phones the population mean for
battery life could be 19 hours, 21 hours or something else. You can use your sample data to construct
a confidence revolt that likely includes the population mean, for
the phone’s battery life, This will give the marketing team a better
idea of the uncertainty in your estimate. It will also help them decide
how to advertise the phone and whether they can claim that its
battery lasts 20 hours or more. Let’s review the steps for
constructing a confidence interval. First, identify a sample statistic. Second, choose a confidence level,
third find the margin of error, and fourth, calculate the interval. First, identify your sample statistic. Your sample represents the average
duration of battery life for 100 cell phones. In this example you’re
working with the sample mean. Next choose a confidence level
Management request that you choose in 95% confidence level. This is the company standard for
new products. Your third step is to find the margin of
error which refers to the range of values above and below your sample statistic,
you can calculate the margin of error by multiplying the Z score
by the standard error. You may recall that the Z score you
use depends on your confidence level. This table shows the Z scores that
correspond to popular confidence levels, such as 90 95 and 99%. The Z score for
a 95% confidence level is 1.96. Now you can calculate the standard error, which measures the variability
of your sample statistic. It shows how much your sample mean
is likely to differ from the actual population mean? The larger the standard error,
the more variability. The formula for the standard error of
the mean is the population standard deviation divided by the square
root of the sample size. Your population standard deviation
is 1.5 and your sample size is 100. If you enter the numbers into the formula,
you get a standard error of 0.15. The margin of error is your Z score
1.96 multiplied by your standard error. 0.15 this equals 0.294. Finally calculate your
confidence interval. The upper limit of your interval is
the sample mean, plus the margin of error. This is 20.5 plus 0.294
equals 20.794 hours or about 20 hours and 48 minutes. The lower limit is the sample
mean minus the margin of error, this is 20.5 minus 0.294 equals 20.206 or about 20 hours and 12 minutes. So you have a 95% confidence interval for
the battery life of the phone that stretches from 20 hours and
12 minutes to 20 hours and 48 minutes. The confidence interval gives company
management important information. The lower limit of your
interval 20 hours and 12 minutes is above
the company’s goal of 20 hours. This helps the marketing team
feel confident about advertising the battery life of the cell
phone to be at least 20 hours. You present your findings to
the company’s stakeholders and the results satisfy everyone
except the head of marketing. The marketing director
has put a lot of time and effort into developing the ad campaign and
wants to be even more confident. The director requests that you analyze
the data using a 99% confidence level To make the marketing director happy
you recalculate your results. Use the same sample data but using
99% confidence level instead of 95%,, Your confidence interval now
stretches from 20 hours and seven minutes to 20 hours and 53 minutes. The lower limit of your interval
is still above 20 hours. This result should give company management
even more confidence about the battery life and
hopefully satisfy the marketing director. You may notice that as the confidence
level gets higher, the confidence interval gets wider, With the confidence level
of 95%, the interval covers 36 minutes, with the confidence level of 99%,
the interval covers 46 minutes. This is because a wider confidence
interval is more likely to include the actual population parameter. Note that in this example we know that
the population standard deviation is 1.5 hours. However, in practice, the population
standard deviation is often unknown and has to be estimated based on
the sample standard deviation. This is because it’s difficult to get
complete data on a large population if you don’t know the population
Standard aviation, this changes the calculations for
the confidence interval to learn more, feel free to check out
the relevant reading. As a data professional, you can use
confidence intervals to help stakeholders make informed decisions
based on accurate estimates. Your analysis of the data will help
shape the company’s strategy for the new product launch
as a data professional. You play a key role in the future
success of the new product.

Work with confidence intervals in Python

Review: Confidence intervals

Home » Google Career Certificates » Google Advanced Data Analytics Professional Certificate » The Power of Statistics » Module 4: Confidence intervals

Module 4: Confidence intervals

Introduction to confidence intervals

Video: Welcome to module 4

Video: Introduction to confidence intervals

Video: Interpret confidence intervals

Reading: Confidence intervals: Correct and incorrect interpretations

Practice Quiz: Test your knowledge: Introduction to confidence Intervals

Construct confidence intervals

Video: Construct a confidence interval for a proportion

Video: Construct a confidence interval for a mean

Work with confidence intervals in Python

Review: Confidence intervals

Share this:

Like this: