Skip to content

Data professionals use smaller samples of data to draw conclusions about large datasets. You’ll learn about the different methods they use to collect and analyze sample data and how they avoid sampling bias. You’ll also learn how sampling distributions can help you make accurate estimates.

Learning Objectives

  • Use Python for sampling
  • Explain the concept of standard error
  • Define the central limit theorem
  • Explain the concept of sampling distribution
  • Explain the concept of sampling bias
  • Describe the benefits and drawbacks of non-probability sampling methods such as convenience, voluntary response, snowball, and purposive
  • Describe the benefits and drawbacks of probability sampling methods such as simple random, stratified, cluster, and systematic
  • Explain the difference between probability sampling and non-probability sampling
  • Describe the main stages of the sampling process
  • Explain the concept of a representative sample

Introduction to sampling


Video: Welcome to module 3

Building on Your Knowledge

  • You’ve mastered descriptive statistics and basic probability – these form the foundation for more advanced techniques.
  • Data science is constantly evolving, so continuous learning is key!

Focus on Sampling

  • What is Sampling? Selecting a smaller subset (sample) from a larger group (population) to make inferences.
  • Why Sample? Studying entire populations can be impractical. A representative sample allows you to draw conclusions about the whole.

Topics We’ll Cover

  1. Review: Revisiting inferential statistics and what makes a sample representative.
  2. Sampling Process: Steps from defining your target population to collecting the sample data.
  3. Types of Sampling: Probability (random chance) vs. non-probability methods, with their pros and cons.
  4. Bias: Understanding how sampling methods can introduce bias (undercoverage, non-response, etc.)
  5. Sampling Distributions: How sample statistics relate to the population.
  6. Central Limit Theorem: A powerful tool for estimating population means.
  7. Python Skills: Using the SciPy stats module for sampling and analysis.

Hello again. It’s great to be back to continue
our learning journey together. You’ve learned a lot so far. You now have a better understanding of the
essential role statistics plays in data science. You also have a solid foundation for
using descriptive statistics and basic probability to describe,
analyze and interpret data. Your knowledge of the fundamental concepts
of statistics is the first step on a path that leads to more advanced methods
like hypothesis testing and regression analysis. What’s really exciting is that this
learning journey will continue throughout your future career as a data professional. The amount of data in the world
is always growing and the data career space is
constantly advancing. I often read about new machine learning
methods to keep up with the changes in the field and
develop new skills to use at work. The next stage of your journey
is all about sampling or the process of selecting a subset
of data from a population. For example if you want to survey
a population of 100,000 people you can select a representative
sample of 100 people. Then you can draw conclusions about
the population based on your sample data coming up. We’ll go over how the sampling process
works and how data professionals use sample data to better understand larger
populations recall that in statistics, a population can refer to any type of
data including people objects, events, measurements and more. We’ll start with a review
of infrared statistics and examine the concept of
a representative sample. Next we’ll go over the different stages
of the sampling process from choosing a target population to collecting data for
your sample. Then we’ll explore the two main types of
sampling methods, probability sampling and non probability sampling. We’ll discuss the benefits and
drawbacks of various sampling methods and describe how random sampling can help
ensure that your sample is representative of the population, will also
introduce different forms of bias and sampling like under coverage and
non response bias and how they can affect non
probability sampling methods. After that,
we’ll explore sampling distributions, which are probability distributions for
sample statistics. You learn about sampling distributions for
both sample means and proportions and how to estimate the corresponding
values for populations. We’ll also cover the central
limit theorem and how it can help you estimate the population
mean for different types of datasets. Finally, you learn how to use
python’s Saipi stats module to work with sampling distributions and make
a point estimate of the population mean. When you’re ready,
I’ll meet you in the next video

Video: Cliff: Value everyone’s contributions

Cliff’s Role at Google

  • Uses data to enhance employee well-being, productivity, and overall experience.
  • Contributes to HR strategy, hybrid work policies, and location planning.

Developing Confidence in Analytics

  • Embracing teamwork: Recognizes that collaboration and diverse skillsets are key, not needing to have all the answers individually.

Communication Strategies

  1. Understanding Business Goals: Initial meetings focus on partners’ broader objectives, not just project specifics. This helps situate your data insights within their larger success picture.
  2. Active Listening & Confirmation: Plays back his understanding of partners’ questions/goals to ensure alignment.
  3. Presenting Data-Driven Options: Instead of just offering answers, suggests various possibilities based on the data. This sparks dialogue and lets partners identify what resonates most.

Key Takeaway

Cliff emphasizes a balance between listening to partners and using your data expertise to collaboratively find the best solutions.

[MUSIC] My name is Cliff and
I’m a workforce planning and people analytics lead at Google. I use data to help our employees be
more productive, more connected, and just overall improve their well-being. I also use data to improve
our HR practices and focus on our hybrid work policies
as well as our location strategy. I’ve always been interested in
issues of workforce development, people strategy, human resources. But I didn’t anticipate what
a central role analytics would play in my work and
how much I would come to love analytics. One of the things that has helped me
develop a confident voice in this space has just been understanding that we work
in teams, we work cross-functionally, and I don’t need to bring all of
the solutions to a problem, right? I’ll bring perspective on how we can
use data to solve this problem, but I’m working with people who
also bring a wealth of skills. And looking at it really
as a partnership where it’s really about leveraging the best
of everyone in the team, that’s helped me bring a lot
of confidence to the work. My go to strategy for communications when working with partners
is to first set up a few low-stakes meetings just to understand what
their broader business goals are. I’m not even thinking about the specific
project we’re working on, but more broadly, how do they define success? That helps me understand where the work
that we’re doing fits into the context of their bigger picture. The second thing I do from a communication
standpoint is to try to play back what I think I heard somebody say. Sort of to repeat it back to them, whether
it’s my understanding of their question or the output that they’re
trying to see from the data, just to test if I actually
really understand their goals. When I’m working with somebody and
I feel we’re not getting to the root of a question or a problem, what I find
is really helpful is laying out from a data perspective a set of different
options or possibilities for them. And engaging in a conversation around
which of those really resonate for them. And so, it’s finding a balance
between listening and telling as a way to
unlock some of my data, they might not have
thought about themselves.

Video: Introduction to sampling

Why Sampling Matters

  • Practicality: Analyzing an entire population is often impossible, especially with large datasets. Sampling saves time, money, and resources.
  • Example: Determining the percentage of laptop users in a city is much more feasible with a sample than surveying the entire population.

The Importance of Representative Samples

  • Accurate Inferences: A representative sample accurately reflects the characteristics of the larger population you’re interested in.
  • Bad Samples = Bad Insights: If your sample is biased (e.g., only includes computer scientists), your conclusions about the population will be unreliable.
  • Example: Using only the heights of professional basketball players wouldn’t give you an accurate picture of the average height within the full population.

Key Takeaways for Data Professionals

  • “Garbage in, garbage out”: A sophisticated model can’t fix the problems caused by a non-representative sample.
  • Focus on Sample Quality: Ensure that your sample is representative of the target population to make reliable inferences and predictions.
A representative sample does not reflect the characteristics of a population. 

False

A representative sample accurately reflects the characteristics of a population. If a sample does not accurately reflect the characteristics of a population, then the inferences will likely be unreliable and predictions inaccurate. This can lead to negative outcomes for stakeholders and organizations.

Earlier in the course, we briefly
discussed the difference between descriptive and inferential statistics. Descriptive statistics like the mean and standard deviation summarize
the main features of the dataset. Inferential statistics use sample
data to draw conclusions or make predictions about
a larger population. Now we’re going to return to
inferential statistics and explore the relationship between
sample and population in more detail. This part of the course
is all about sampling, the process of drawing a subset
of data from a population. In this video, we’ll discuss how data professionals
use sampling in data science and the importance of working with the sample
that is representative of the population. Data professionals use sampling to
analyze many different types of data. Here are some questions that sampling
has helped my data science team answer, how many products in an app store do
we need to test to feel confident that all the products
are secure from malware? How do we select a sample of users
to run an effective A/B test for an online retail store? And how do we select a sample of customers
of a video streaming service to get reliable feedback on the shows they watch? Sampling is useful in data science because
selecting a sample requires less time than collecting data on
every item in a population. Using a sample saves money and
resources and analyzing a sample is more practical
than analyzing an entire population. This is especially important
in modern data analytics where you often deal with
extremely large datasets. For example, let’s say you want to know
the percentage of people in a large city who uses a laptop computer. One way to do this is to survey
every resident in the city. First, it would be very difficult
to access contact information for every resident of the city. Second, giving a survey to every resident
of the city would be very expensive, complicated and time consuming. Another way is to find a much
smaller subset of residents and give them a survey. This subset is your sample, then you can
use the sample data you collect about laptop use to draw conclusions about
the laptop use of the entire population. Collecting a sample is faster,
more practical and less expensive than collecting data
on every member of the population. Keep in mind that your sample should
be representative of your population. Recall that a representative sample
accurately reflects the characteristics of a population. The inferences and predictions you make about your
population are based on your sample data. If your sample doesn’t accurately
reflect your population, then your inferences will not be reliable
and your predictions will not be accurate. And this can lead to negative outcomes for
stakeholders and organizations. For instance, let’s say you only
contact computer scientists for your laptop survey. Your sample will not accurately
reflect the overall population. Computer scientists are much more likely
to use a laptop computer than the typical city resident. Many residents may not have
access to any kind of computer or even know how to use one. A sample that only includes computer
scientists is not representative. A representative sample would include
people with different levels of computer knowledge and access. Let’s consider another scenario. Imagine you want to find out
the average height of every adult in the United States, that’s a lot of people. It would take an incredible
amount of time, energy and money to even attempt to measure
every person in the country. Instead you can take
a sample of 100 people and use that sample data to draw conclusions
about the entire population. Now, let’s say you have sample data only
from professional basketball players. Pro basketball players are really tall,
some are over seven feet tall. On average, they’re much taller than
almost everybody else in the population. Their average height does not accurately
reflect the average height of the overall population. A sample that includes only pro basketball
players is not representative of every adult in the US. As a data professional,
I work with sample data every day. I can tell you that having a
representative sample is super important. A wise teammate of mine once said that a
good model can’t overcome a bad sample and the right. Data professionals work with powerful
statistical tools that can model complex datasets and
help generate valuable insights. But if the sample data you’re working
with does not accurately reflect your population that is if your
sample is not representative, then it ultimately doesn’t
matter how good your model is. If your predictive model
is based on a bad sample, then your predictions
will not be accurate. Ultimately, the quality of
your sample helps determine the quality of the insights
you share with stakeholders. To make reliable inferences about all your
customers based on feedback from a sample of customers, make sure your sample
is representative of the population.

Reading: The relationship between sample and population

Reading

Video: The sampling process

Why Sampling Matters

  • Data professionals constantly work with samples of larger populations.
  • Understanding how sampling works is crucial to assess sample quality, biases, and how well it represents the population.
  • Example: A biased sample (like only polling basketball players on average height) leads to bad conclusions.

5 Steps of the Sampling Process

  1. Identify Target Population: Define the exact group you want to study (e.g., legal adult residents of Vancouver).
  2. Select Sampling Frame: Create a practical list of the population you can access (e.g., a voter registry, even if slightly imperfect).
  3. Choose Sampling Method:
    • Probability sampling (random selection) is ideal for representativeness.
    • Non-probability (convenience, researcher bias) is less reliable.
  4. Determine Sample Size: Larger samples generally increase accuracy, but there are tradeoffs with cost and time.
  5. Collect Sample Data: Carry out your survey, experiment, or data gathering on your selected sample.

Example: Subway Opinion Poll

The video uses the example of polling support for a new subway system to illustrate how each step affects the final results.

Key Takeaways

  • Careful sampling makes your sample data reflect the population you’re studying.
  • This leads to more reliable conclusions and better decision-making.

When working with sample data, what is the first step in the sampling process?

Identify the target population

The first step in the sampling process is to identify the target population. The sampling process helps determine whether a sample is representative of the population and if it is unbiased.

As a data professional,
you’ll work with sample data all the time. Often this will be sample data previously
collected by other researchers. Sometimes your team may
collect their own data. Either way, it’s important to know
how the sampling process works because it directly affects
the quality of your sample data. The sampling process helps determine
whether your sample is representative of the population and
whether your sample is unbiased. If you estimate the mean height of a
country’s total adult population based on a sample of professional
basketball players, your estimate will not be
accurate. In this video, we’ll go over the main stages
of the typical sampling process. This will give you a useful framework for
understanding how sampling is conducted and how the sampling process
can affect your sample data. To get a clear overview of the sampling
process, let’s divide it into five steps. Step one, identify the target population. Step two, select the sampling frame. Step three,
choose the sampling method. Step four, determine the sample size, and step five,
collect the sample data. As an example, let’s consider a public opinion poll. Imagine the city government of Vancouver,
Canada wants to build a new subway system. The public will vote on whether or
not to move forward with the project. The city government wants to find out if
there’s public support for the project. They ask you to take a poll and estimate the percentage of adult
residents that support the project. Legal adults are 18 years or older. The first stage in the sampling process
is to identify your target population. The target population is the complete set
of elements that you’re interested in knowing more about. In this case, the target population
includes every resident in the city who is 18 years or older and eligible to vote. Let’s say that the city contains 100,000
such residents. Since it’s too difficult and too expensive to survey
everyone in the target population, you decide to take a sample. The next step in the sampling process
is to create a sampling frame. A sampling frame is a list of all
the items in your target population. Basically, it’s a complete list of
everyone or everything you’re interested in researching. The difference
between a target population and a sampling frame is that the population
is general and the frame is specific. So if your target population is 100,000
city residents who are 18 years or older and eligible to vote, your sampling
frame could be a list of names for all these residents–from
Alana Aoki to Zoe Zappa. For practical reasons, your sampling frame may
not accurately match your target population because you may not have
access to every member of the population. For instance, the city may not have
reliable contact information about each resident, or perhaps not all eligible
voters are actually registered to vote, so their opinions about the potential
subway system aren’t relevant, since the project will be decided by
an election. For reasons like these, your sampling frame will not exactly
overlap with your target population. Your sampling frame will include
the list of residents 18 or over that you’re able to obtain
useful information about. So the sampling frame is the accessible
part of your target population. Next, you need to choose a sampling method,
which is step three of the sampling process. One way to help ensure that your sample
is representative is to choose the right sampling method. There are two main types of sampling
methods: probability sampling and non-probability sampling. In later videos, we’ll explore the specifics
in more detail. For now, just know that probability sampling uses
random selection to generate a sample. Non-probability sampling is
often based on convenience or the personal preferences of the researcher
rather than random selection. Because probability sampling methods are
based on random selection, every person in the population has an equal chance
of being included in the sample. This gives you the best chance to get
a representative sample, as your results are more likely to accurately
reflect the overall population. So, assuming you have the budget and the
time, you can use a probability sampling method for
your poll about the subway project. Using random selection gives you the best
chance of getting a sample that’s representative of your population. Step four of the sampling process is to
determine the best size for your sample, since you don’t have the resources to
pull everyone in your sampling frame. In statistics, sample size refers to the
number of individuals or items chosen for a study or experiment. Sample size helps determine
the accuracy of the predictions you make about the population. In general, the larger the sample size, the more accurate your predictions. Based
on the desired level of accuracy for your survey, you can decide how many
eligible voters to include in your sample. Now, you’re ready to
collect your sample data. This is the final step
in the sampling process. To pull the residents selected for
your sample, you decide to conduct a survey.
Based on the survey responses, you determine the percentage
of eligible voters 18 and over who favor the proposed
subject project. Then, you share this information with city
leaders to help them make a more informed decision. Effective sampling ensures that
your sample data is representative of your target population. Then, when you use sample data to make
inferences about the population, you can be reasonably confident
that your inferences are reliable. Your poll will give city leaders
a better idea of public support for the new subway and help inform
future decisions about the project. The decisions you make at each step of the
sampling process can affect the quality of your sample data. Understanding the sampling process will
make you a better data professional, whether you’re analyzing data
collected by other researchers or conducting a survey on your own.

Reading: The stages of the sampling process

Reading

Video: Compare sampling methods

Types of Probability Sampling

  • Simple Random Sampling:
    • Every member of the population has an equal chance of selection.
    • Pro: Unbiased, representative results.
    • Con: Can be expensive/time-consuming for large populations.
  • Stratified Random Sampling:
    • Population is divided into strata (groups) and members are randomly selected from each group.
    • Pro: Ensures representation from all relevant groups.
    • Con: Requires knowledge of the population to pick effective strata.
  • Cluster Random Sampling:
    • Population is divided into clusters, and entire clusters are randomly chosen for the sample.
    • Pro: Useful for large, diverse populations with clear subgroups.
    • Con: Clusters may not perfectly reflect the overall population.
  • Systematic Random Sampling:
    • Population is ordered, and members are selected at regular intervals from a random starting point.
    • Pro: Quick and convenient with a full member list.
    • Con: Requires knowing the total population size beforehand.

Key Point: All probability sampling methods rely on random selection to reduce bias and get more accurate results compared to non-probability methods.

Fill in the blank: Probability sampling uses ____ selection to generate a sample.

Random

Probability sampling uses random selection to generate a sample. There are four methods: simple, stratified, cluster, and systematic. All are based on random selection, which is the preferred method of sampling for most data professionals

In a previous video, you learned the
differences between probability and
non-probability sampling. Then you conducted a survey
using probability sampling, which is the third step
of the sampling process. In this video, you
will learn more about the different methods
of probability sampling. Then we’ll discuss the benefits and drawbacks of each method. There are four different
probability sampling methods: simple random sampling, stratified random sampling,
cluster random sampling, and systematic random sampling. In a simple random sample, every member of a
population is selected randomly and has an equal
chance of being chosen. You can randomly
select members using a random number generator or by another method of
random selection. For example, say
you want to survey the employees of a company
about their work experience. The company employs
1,000 people. You can assign each employee in the company database a
number from 1-1,000, and then use a random
number generator to select 100 people
for your sample. The main benefit of simple random samples is
that they’re usually fairly representative since
every member of the population has an equal
chance of being chosen. Random samples
tend to avoid bias and surveys like these give
you more accurate results. However, in practice, it’s often expensive
and time-consuming to conduct large, simple
random samples. If your sample size
is not large enough, a specific group of people in the population may be
underrepresented in your sample. If you use a larger sample size, your sample will more accurately
reflect the population. The next method of
probability sampling is a stratified random sample. In a stratified random sample, you divide a population
into groups and randomly select some members from each group to
be in the sample. These groups are called strata. Strata can be organized
by age, gender, income, or whatever category you’re interested in studying. For example, say you want to
survey high school students about how much time they
spend studying on weekends. You might divide the
student population according to age: 14, 15, 16, and 17-year-olds. Then you can survey
an equal number of students from each age group. Stratified random samples
help ensure that members from each group
in the population are included in the survey. This method allows you to draw more accurate conclusions
about the relevant groups. For instance, 14-year-olds
and 17-year-olds, may have different perspectives about studying on the weekends. Older students can
drive and may have more social activities
or work on the weekends. Stratified sampling will
capture both perspectives. One main disadvantage of stratified sampling is that
it can be difficult to identify appropriate strata for a study if you lack
knowledge of a population. For example, if
you want to study median income among
a population, you may want to stratify
your sample by job type, or industry, or location,
or education level. If you don’t know how relevant these categories are
to median income, it will be difficult to choose the best one for your study. The next method of
probability sampling is a cluster random sample. When you’re conducting a
cluster random sample, you divide a population
into clusters, randomly select
certain clusters, and include all members from the chosen clusters
in the sample. Cluster sampling is similar to stratified random sampling, but in stratified sampling, you randomly choose some members from each group to
be in the sample. In cluster sampling, you choose all members from a group
to be in the sample. Clusters are divided using identifying details,
such as age, gender, location, or
whatever you want to study. For example, imagine you
want to conduct a survey of employees at a global company
using cluster sampling. The company has 10 offices in different cities
around the world. Each office has about
the same number of employees and similar job roles. You randomly select
three offices in three different
cities as clusters. You include all the employees at the three offices
in your sample. One advantage of this method
is that a cluster sample gets every member from
a particular cluster, which is useful
when each cluster reflects the
population as a whole. This method is helpful
when dealing with large and diverse
populations that have clearly defined subgroups. If researchers want
to learn about the preferences of primary
school students in Oslo, Norway, they can
use one school as a representative sample of
all schools in the city. A main disadvantage of cluster sampling is that
it may be difficult to create clusters
that accurately reflect the overall population. For example, for
practical reasons, you may only have access
to the offices in the United States
when your company has locations all
over the world. Employees in the
United States may have different characteristics
and values than employees in
other countries. The final method of
probability sampling is a systematic random sample. In systematic random samples, you put every member of a population into an
ordered sequence. Then you choose a
random starting point in the sequence and select members for your
sample at regular intervals. Let’s assume you want to survey students at a community college. For a systematic random sample, you’d put the students’
names in alphabetical order, randomly choose a
starting point, and pick every fifth name
to be in the sample. Systematic random samples are often representative
of the population since every member has an equal chance of being
included in the sample. Whether the student’s
last name starts with B or R isn’t going to affect
their characteristics. Systematic sampling
is also quick and convenient when you have a complete list of the
members of your population. One disadvantage of systematic sampling
is that you need to know the size of the population that you want to study
before you begin. If you don’t have
this information, it’s difficult to choose
consistent intervals. The four methods of probability
sampling we’ve covered, simple, stratified, cluster, and systematic, are all
based on random selection, which is the preferred method of sampling for most
data professionals. These methods can
help you create a sample that is representative
of the population. In an upcoming video, we’ll check out some methods of non-probability sampling and why they’re not considered
representative.

Reading: Probability sampling methods

Video: The impact of bias in sampling

The Importance of Representative Samples

  • Machine learning models trained on biased data are likely to make unfair and inaccurate decisions.
  • Representative samples, where everyone in the population has an equal chance of being included, help reduce bias and create fairer models.

Non-Probability Sampling

  • While cheaper and more convenient than probability sampling, non-probability methods are prone to bias.
  • These methods are useful for early exploration but shouldn’t be relied on for making conclusions about the entire population.

Four Main Types of Non-Probability Sampling

  1. Convenience Sampling: Choosing participants who are easy to reach.
    • Prone to undercoverage bias (certain groups are underrepresented)
    • Example: Polling people at a school, missing those who don’t attend.
  2. Voluntary Response Sampling: People volunteer to participate.
    • Prone to nonresponse bias (those with strong opinions are more likely to respond).
    • Example: A restaurant’s online survey will likely get extreme positive or negative views.
  3. Snowball Sampling: Participants recruit others to join.
    • Participants tend to be too similar, making the sample unrepresentative.
    • Example: Study on cheating, where the initial few students recruit friends who might also have cheated.
  4. Purposive Sampling: Picking participants based on specific criteria.
    • Intentionally excludes groups, making the sample focused but potentially biased.
    • Example: Surveying only high-GPA students about teaching methods misses insights from those who struggle.

Key Takeaway for Data Professionals

  • Be constantly aware of bias, from data collection to presenting results.
  • Understanding common types of bias helps you spot them in your work and take steps to minimize them.

Sampling bias occurs when a sample is not representative of the population as a whole.

True

Sampling bias occurs when a sample is not representative of the population as a whole. Models based on representative samples are much more likely to lead to fair and unbiased decisions.

In my work as a data
professional I often use sample data to help build
machine learning models. Today, a machine-learning
model may help determine if a person
gets an approval for a loan, an interview for a job or an
accurate medical diagnosis. Models based on
representative samples are much more likely to make fair and unbiased decisions about who gets a loan
or a job interview. Using samples that
are representative of the different
types of people in the population helps ensure that each person receives the
treatment that is best for them. Unfortunately, bias can
affect sample data. Sampling bias occurs
when a sample is not representative of the
population as a whole. To eliminate bias, I try
to use samples that are representative of the
overall population. The consequences of
drawing conclusions from a non-representative
sample can be serious. Recently you learned that
probability sampling methods use random selection, which helps avoid sampling bias. A randomly chosen sample
means that all members of the population have an equal
chance of being included. In contrast, non-probability
sampling methods do not use random selection, so they do not typically
generate representative samples. In fact, they often
result in biased samples. However, non-probability
sampling is often less expensive and more convenient
for researchers to conduct. Sometimes due to budget, time or other reasons, it’s just not possible to
use probability sampling. Plus non-probability
methods can be useful for exploratory studies
which seek to develop an initial
understanding of a population, not draw conclusions or make predictions about the
population as a whole. In this video, we’ll
discuss four methods of non-probability
sampling and learn how sampling bias can
affect each method. These four methods are
convenient sampling, voluntary response sampling, snowball sampling, and
purposive sampling. Let’s start with
convenience sampling. In this method, you
choose members of a population that are
easy to contact or reach. As the name suggests, conducting a convenience
sample involves collecting a sample from
somewhere convenient to you, such as your workplace, a local school,
or a public park. For example, to conduct
an opinion poll, a researcher might stand in
front of a local high school during the day and poll people
that happened to walk by. Because these symbols are
based on convenience to the researcher and not a broader sample of
the population, convenience samples often
show undercoverage bias. Undercoverage bias occurs
when some members of a population are inadequately
represented in the sample. For instance, people
who don’t work at or attend the school will not be represented as much
in this sample. The next method of
non-probability sampling is voluntary response sampling. This type of sample
consists of members of a population who volunteer
to participate in a study. For example, the
owners of a restaurant want to know how people feel
about their dinner options. They ask their regular
customers to take an online survey about the quality of the
restaurant’s food. Voluntary response
samples tend to suffer from nonresponse bias, which occurs when certain
groups of people are less likely to
provide responses. People who voluntarily respond will likely have
stronger opinions, either positive or negative than the rest of the population. This makes the
volunteer customers at the restaurant an
unrepresentative sample. The next non-probability
sampling method is snowball sampling. In a snowball sample researchers recruit initial
participants to be in a study and then
ask them to recruit other people to
participate in the study. Like a snowball, the
sample size gets bigger and bigger as more
participants join in. For example, if a study was investigating cheating
among college students, potential participants might
not want to come forward. But if a researcher can find a couple of students
willing to participate, these two students
may know others who have also cheated on exams. The initial participants could then recruit others by sharing the benefits of the study and reassuring them of
confidentiality. Although it may seem
convenient that study participants
help build the sample, this type of recruiting
can lead to sampling bias. Because initial
participants recruit additional participants
on their own, it’s likely that most of them will share similar
characteristics. These characteristics might be unrepresentative of the total
population under study. In purposive
sampling researchers select participants based on
the purpose of their study. Because participants
are selected for the sample according to
the needs of the study, applicants who do not fit
the profiles are rejected. For example, a researcher
wants to survey students on the effectiveness of certain teaching methods
at the university. The researcher only wants to include students who regularly attend class and have an established record of
academic achievement. So they select the students with the highest grade point averages to participate in the study. Purposive sampling,
the researcher often intentionally
exclude certain groups from the sample to focus
on a specific group they think is most
relevant to their study. In this case, the
researcher excludes students who don’t have
high grade point averages. This could lead to
biased outcomes because the students
in the sample are not likely to be representative of the overall
student population. As a data professional, you have to think about bias and fairness from the
moment you start collecting sample data to the time you present
your conclusions. Once you become aware of
some common forms of bias, you can remain on the alert
for bias in any form.

Reading: Non-probability sampling methods

Reading

Practice Quiz: Test your knowledge: Introduction to sampling

A data professional is conducting an election poll. As a first step in the sampling process, they identify the target population. What is the second step in the sampling process?

Fill in the blank: In a _____ sample, every member of a population is selected randomly and has an equal chance of being chosen.

Non-probability sampling includes which of the following sampling methods? Select all that apply.

Sampling distributions

Video: How sampling affects your data

What is a Sampling Distribution?

  • A sampling distribution is a probability distribution that represents the possible values of a sample statistic (like the sample mean).
  • It’s created by taking repeated random samples of the same size from a population.
  • The sampling distribution of the mean shows how the means of different samples from the same population would be distributed.

Key Points about the Sampling Distribution of the Mean

  • Central Limit Theorem: As the sample size increases, the sampling distribution of the mean approaches a normal distribution, even if the population distribution is not normal.
  • Mean: The mean of the sampling distribution of the mean is equal to the population mean.
  • Variability: The variability in the sampling distribution is measured by the standard error of the mean. The standard error gets smaller as the sample size increases.
  • Estimation: The sample mean can be used as a point estimate to approximate the population mean.

Standard Error

  • The standard error of the mean represents the average amount that sample means deviate from the true population mean.
  • A smaller standard error means a sample mean is more likely to be a good estimate of the population mean.
  • Formula: Standard Error = Sample Standard Deviation / Square Root of Sample Size (S / √n)

Key Takeaway

Understanding sampling distributions and the standard error are critical concepts in statistics. They help data professionals assess the accuracy and reliability of estimates derived from sample data.

What term describes a probability distribution of a sample statistic?

Sampling distribution

Sampling distribution describes a probability distribution of a sample statistic. Probability distribution represents the possible outcomes of a random variable; sampling distribution represents the possible outcomes for a sample statistic.

In previous videos, you learned
how the sampling process works and the benefits and drawbacks of various
sampling methods. As a data professional, I often work with sample data to make informed predictions about future sales revenue
or product performance. Understanding how
sampling effects your data both positively
and negatively, will be important in
your future career in data analytics. For instance, one-way
data professionals use sample statistics is to
estimate population parameters. As you may recall, a statistic
is a characteristic of a sample and a parameter is a characteristic
of a population. For example, the mean weight of a random sample of 100
penguins is a statistic. The mean weight of the
total population of 10,000 penguins is a parameter. A data professional might use the mean weight of the sample of 100 penguins to estimate the mean weight of
the population. This type of estimate is
called a point estimate. A point estimate uses a single value to estimate
a population parameter. In this video, we’ll discuss
the concept of sampling distribution and
how it can help you represent the possible
outcomes of a random sample. You will also learn how the
sampling distribution of the sample mean
can help you make a point estimate of
the population mean. A sampling distribution is a probability distribution
of a sample statistic. Recall that a
probability distribution represents the possible
outcomes of a random variable, such as a coin toss or die roll. In the same way, a sampling
distribution represents the possible outcomes for a sample statistic,
like the mean. Imagine you take repeated
simple random samples of the same size
from a population. Since each sample is random, the mean value will
vary from sample to sample in a way that cannot
be predicted with certainty. Right now, this may
seem a bit abstract. To get a better idea of the sampling distribution
of the mean, let’s continue with
our penguin example. Imagine you’re
setting a population of 10,000 blue penguins, which are the smallest of
all known penguin species. You want to find out
the mean weight of a blue penguin in
this population. Since it would take
too long to locate and weigh every single penguin, you instead collect sample
data from the population. Let’s say you take
repeated simple random samples of 10 penguins, each from the population. In other words, you randomly choose 10 penguins
from the group, weigh them, and then repeat this process with a different
set of 10 penguins. For your first sample, you find the mean weight of the 10 penguins is 3.1 pounds. For your second sample, the mean weight of the 10
penguins is 2.9 pounds. For your third sample, the mean weight is 2.8
pounds, and so on. Imagine that the
true mean weight of a penguin in this
population is three pounds. Although in practice,
you wouldn’t know this unless you weighed
every single penguin. Each time you take a
sample of 10 penguins, it’s likely that the mean
weight of the penguins in your sample will be close to the population mean
of three pounds, but not exactly 3 pounds. Every once in a while, you
may get a sample full of smaller than average
penguins with a mean weight of
2.5 pounds or less. Or you might get a simple
full of larger than average penguins
with a mean weight of 3.5 pounds or more. The mean weight will vary
randomly from sample to sample. Sampling variability
refers to how much an estimate varies
between samples. You can use a sampling
distribution to represent the frequency of all your
different sample means. I find that it helps to visualize these samples
as a histogram. Let’s plot 10 simple random
samples of 10 penguins each. The most frequently
occurring sample means will be around three pounds. The least frequent
sample means will be the more extreme ways such
as 2.3 or 3.7 pounds. As you increase the
size of a sample, the mean weight of your
sample data will get closer to the mean weight
of the population. If you sample the entire
population, in other words, if you actually weighed
all 10,000 penguins, your sample mean would be the same as your population mean. But to get an accurate estimate
of the population mean, you don’t have to
weigh 10,000 penguins. If you take a large enough
sample size from a population, say 100 penguins, your sample mean will be an accurate estimate of
the population mean. This point is based on the
central limit theorem, which we’ll explore in more detail later
on in the course. For now, just know that if
your sample is large enough, your sample mean will roughly
equal the population mean. For instance, imagine you collect a sample of 100 penguins and find that the mean weight
of your sample is 3 pounds. This means that your best
estimate for the mean weight of the entire penguin
population is also 3 pounds. You can also use
your sampling in it to estimate how accurately the mean weight of
any given sample represents the
population mean weight. This is useful to know because the mean varies from
sample to sample in any given sample is not necessarily an exact reflection
of the population mean. For example, the
true mean weight for the penguin population
might be three pounds. The mean weight for
any given sample of penguins might be 3.3 pounds, 2.8 pounds, 2.4
pounds, and so on. The more variability
in your sample data, the less likely it is
that the sample mean is an accurate estimate
of the population mean. Data professionals use
the standard deviation of the sample means to
measure this variability. Recall that the standard deviation measures
the variability of your data or how spread
out your data values are. The more spread between
the data values, the larger the
standard deviation. In statistics, the
standard deviation of a sample statistic is
called the standard error. The standard error
of the mean measures variability among all
your sample means. A larger standard
error indicates that the sample means
are more spread out, whether there’s
more variability. A smaller standard
error indicates that the sample means
are closer together, or that there’s
less variability. The less standard error, the more likely it is
that your sample mean is an accurate estimate
of the population mean. For example, say you take three random samples
of 10 penguins each, the mean weight of the
first sample is 3.3 pounds. The second is 3.1 pounds, and the third is 2.9 pounds. There’s not much variability among these three sample means. The values are all
close together. The standard error will
be relatively small. Now, say you take another three random samples
of 10 penguins each. The mean weight of the
first sample is 2.2 pounds, the second is 3.2 pounds, and the third is 4.2 pounds. There’s more variability among
these three sample means. The values are more spread out. The standard error will
be relatively large. Note that the concept
of standard error is based on the practice
of repeated sampling. In reality, researchers usually work with a single sample. It’s often too
complicated, expensive, or time-consuming to take repeated samples
of a population. Instead, statisticians
have derived a formula for calculating the standard error based on the mathematical assumption
of repeated sampling. You can use the
following formula to calculate the standard
error of the sample mean. S divided by the
square root of n, where S is the sample
standard deviation and n is the sample size. For example, in your
study of penguin weights, imagine that a sample of 100
penguins has a mean weight of three pounds and a standard
deviation of one pound. You can calculate the
standard error by dividing the sample standard deviation
1 by the square root of the sample size 100.1 divided by the square root of
100 equals 0.1. This means that your
best estimate for the true population mean weight of all penguins is 3 pounds, but you should expect that the mean weight from
one sample to the next will vary with a standard
deviation of about 0.1 pounds. As your sample size gets larger, your standard error
gets smaller. This is because standard
error measures the difference between your sample mean and
the actual population mean. As your sample gets larger, your sample mean gets closer to the actual population mean. The more accurate the estimate
of the population mean, the smaller the standard error. Say you collect a
sample of 10,000 penguins instead
of 100 penguins. You find that the
sample mean weight is 3 pounds and the sample
standard deviation is 1 pound. The standard error is one
divided by the square root of 10,000, which equals 0.01. Your best estimate
for the sample mean will still be 3 pounds, but now you can expect
that the mean weight from one sample of
penguins to the next will vary with a standard
deviation of just 0.01 pounds. In general, you can have more confidence in
your estimates as the sample size gets larger and the standard
error gets smaller. This is because the mean of your sampling distribution gets closer to the population mean. Coming up, we’ll
explore this idea further when we talk about
the central limit theorem.

Video: The central limit theorem

What is the Central Limit Theorem (CLT)?

  • Key Idea: As the sample size increases, the distribution of sample means will approach a normal distribution (bell curve), regardless of the shape of the original population distribution.
  • Practical Benefit: If your sample is large enough, the sample mean will be a close approximation of the true population mean.

Why the CLT Matters

  • No Need to Know Population Distribution: You can estimate population parameters (like the mean) without needing detailed knowledge of the entire population’s distribution.
  • Sample Size Guidelines: While there’s no single rule, samples of 30 or more are generally considered sufficient for the CLT to apply.
  • Real-World Applications: The CLT is used in economics, science, business, and more to understand things like average income, animal populations, and commute times.

Examples

  • Household Income: Even if the distribution of income is skewed, a large enough sample will give a sampling distribution that’s normal and provides a good estimate of average income.
  • Coffee Consumption: You can estimate the average coffee consumption across a large population by studying representative samples and applying the CLT.

Key Takeaway The Central Limit Theorem is a powerful tool that helps data professionals make inferences about populations based on sample data.

Fill in the blank: The central limit theorem states that the sampling distribution of the mean approaches a _____ distribution as the sample size increases

Normal

The central limit theorem states that the sampling distribution of the mean approaches a normal distribution as the sample size increases. In other words, as the sample size increases, the sampling distribution assumes the shape of a bell curve. If a large enough sample of the population is used, the sample mean will be roughly equal to the population mean.

Recently, we talked briefly about the central limit theorem and the relationship between sample size and the sample mean. Data professionals use the
central limit theorem to estimate population parameters
for data in economics, science, business,
and other fields. For example, they may use
the theorem to estimate the mean annual household income for an entire city or country, the mean height and weight for an entire animal or
plant population, or the mean commute time for all the employees of
large corporation. In this video, you’ll learn more about the central
limit theorem and how it can help you estimate the population mean for
different types of data. The central limit theorem states that the sampling
distribution of the mean approaches a
normal distribution as the sample size increases. In other words, as
your sample increases, your sampling
distribution assumes the shape of a bell curve. If you take a large enough
sample of the population, the sample mean will be roughly equal to the
population mean. For example, say you
want to estimate the average height for a university student
in South Africa. Instead of measuring
millions of students, you can get data on a
representative sample of students. If your sample size
is large enough, the mean height of
your sample will be roughly equal to the mean
height of the population. There is no exact rule for how large a sample
size needs to be in order for the central
limit theorem to apply. In general, a sample size of 30 or more is
considered sufficient. Exploratory data analysis
can help you determine how large of a sample is
necessary for a given dataset. What’s really powerful about the central limit
theorem is that it holds true for
any population. You don’t need to
know the shape of your population distribution in advance to apply the theorem. If you collect a
large enough sample, the shape of your
sampling distribution will follow a normal
distribution. This pattern is true even if your population has a
skewed distribution. For example, here is a graph for annual household income in
the US for the year 2010. The x-axis represents
annual income and the y-axis represents
the percent of households that
have that income. You may notice how
the data is skewed to the right and the shape of the distribution
is far from normal. The distribution for annual
income is skewed because of the extraordinarily high incomes of the wealthiest households. However, if you sample
it comes at random among all households and
take a large enough sample, your sampling distribution will follow a normal distribution. This is true even though the
population distribution, every US household
is not normal. The mean income of your sampling distribution will give you an accurate estimate of the mean income of the
entire population. Let’s check out another example. Imagine you’re studying
the population of coffee drinkers in
the United States. You want to know the
average amount of coffee each person
drinks per day, but you don’t have
the time or money to survey every single
coffee drinker in the US, which, by the way, is around 150 million people. Instead of surveying
the entire population, you collect repeated
random samples up to 100 coffee drinkers. Using this data, you calculate the mean amount of
coffee consumed per day for your first
sample, 22.5 ounces. For your second sample, the mean amount is 28.2 ounces. You take a third sample, the mean amount is
25.4 ounces and so on. In theory, you could take 10, 50 or 100 samples and keep
increasing the sample size until you’ve surveyed
all 150 million people about their coffee consumption. The central limit theorem says that as your
sample size increases, the shape of your sampling
distribution will increasingly resemble
a bell curve. In practice, this
specific sample size you choose will depend
on factors like budget, time, resources, and the desire level of
confidence for your estimate. If you take a large enough
sample from the population, the mean of your
sampling distribution will equal the population mean. From this sample
of the population, you can accurately estimate
the average amount of coffee consumed per day for
the entire population. In case you’re wondering, the average American
drinks around 24 ounces or three cups
of coffee per day. Based on what I’ve noticed, if we took a sample of
only data professionals, the mean value might
be even higher. Whether you’re measuring
coffee consumption or household income, the central limit theorem
is a useful method for better understanding the
distribution of your data.

Reading: Infer population parameters with the central limit theorem

Reading


Video: The sampling distribution of the proportion

What is a Population Proportion?

  • It’s the percentage of individuals in a population with a specific characteristic (e.g., percentage of teens preferring slip-on sneakers).

Why Sample?

  • Surveying an entire population is often impractical. We use samples to estimate the true population proportion.

Sampling Distribution of the Proportion

  • Just like with sample means, sample proportions vary between samples.
  • A sampling distribution shows the frequency of different possible sample proportions.
  • As sample size increases, the distribution becomes approximately normal (due to the Central Limit Theorem).

Estimating the Population Proportion

  • If your sample is large enough, the sample proportion is a good estimate of the true population proportion.

Standard Error of the Proportion

  • Measures how much a sample proportion is likely to differ from the true population proportion.
  • Formula: √(p-hat * (1 – p-hat) / n) where p-hat is the sample proportion and n is the sample size.
  • Larger sample size = smaller standard error = more accurate estimate.

Key Takeaways

  • Sampling distributions help us understand the variability of sample proportions and how they relate to the true population proportion.
  • Data professionals use this knowledge to provide robust estimates and accurate information for decision-making.

In this part of the course, we’ve been
talking about how data professionals use sample statistics to
estimate population parameters. Recently, you learned how to use
the sampling distribution of the mean to estimate the actual population mean. For example, you might estimate the mean
weight of an animal population or the mean salary of all the people who
work in the hospitality industry. Data professionals also use sampling
distributions to estimate population proportions. In statistics, a population proportion
refers to the percentage of individuals or elements in a population that
share a certain characteristic. Proportions measure percentages or
parts of a whole. For example, you might survey 100
employees at a large company to estimate what percentage of all employees like
the food at the office cafeteria. Data professionals might also use the
sampling distribution of the proportion to estimate the proportion of all visitors
to a website who make a purchase before leaving. Assembly line products that meet
quality control standards, or voters who support a candidate
in an upcoming election. In this video, you’ll learn about
the sampling distribution of the sample proportion and how it can help you
estimate the population proportion. Imagine you work for
a market research firm, your client is a company that
manufactures sneakers and wants to make sure their sneakers
appeal to the largest audience. You’re asked to research sneaker
preferences among residents of Santiago, Chile, who are between 16 and
19 years old. There are 100,000 teenagers
in that age group. You might want to find out what proportion
of this population prefers slip on sneakers over sneakers with shoelaces. Since it would take too long to locate and
survey all 100,000 teens, you instead collect sample
data from the population. Let’s say you take repeated
simple random samples of 100 teenagers from the overall population. In your first sample, you find that 12%
of teenagers prefer slip on sneakers. In your second sample,
you find that 8% prefer slip on sneakers. You take a third sample,
and the proportion is 11%. Earlier, we talked about sampling
variability for the sample means or how the value of the mean varies
from one sample to the next. The same holds true for proportions. Let’s assume we know that 10% of teenagers
in the total population prefer slip-on sneakers. In most of the samples, the proportion
of teenagers who prefer slip-on sneakers will be close to the true population
proportion of 10%, but not exactly 10%. Occasionally, a sample may turn out to
have a proportion that’s very small or very large. You can use a sampling distribution to
represent the frequency of all your different sample proportions. For instance, if you take ten simple random samples
of 100 teenagers from this population, you can show the sampling distribution
of the proportion in a histogram. The most frequently occurring values
in your sample data will be around 10%. The values that occur least frequently
will be the more extreme proportions, such as 5% or 15%. As with the sample means, the central limit theorem also
applies to sample proportions. As your sample size increases, the distribution of the sample
proportion will be approximately normal. The overall average, or mean proportion,
is located in the center of the curve. If you take a sufficiently large
enough sample of teenagers, the sample proportion will be an accurate estimate
of the true population proportion. If you survey 1,000 teenagers and find that 10% prefer slip on sneakers,
this means that your best estimate for the proportion of all teenagers
who prefer slip-ons is also 10%. As with the sample mean, you can use
the standard error of the proportion to measure sampling variability. This tells you how much a particular
sample proportion is likely to differ from the true population proportion. This is useful to know because the
proportion varies from sample to sample, and any given sample proportion probably
won’t be exactly equal to the true population proportion. The true proportion of teenagers who
prefer slip on sneakers might be 10%, but the proportion of any given sample
might be 12%, 9%, 7%, and so on. The more variability in your sample data,
the less likely it is that the sample proportion is an accurate estimate
of the population proportion. It’s important to understand the accuracy
of your estimate because stakeholder decisions are often based on
the estimates you provide. You can use the following formula
to calculate the standard error of the proportion the square root of
p hat open parentheses one minus p hat closed parentheses divided by n. P hat refers to the population proportion,
and n refers to the sample size. In statistics, you say hat when you refer
to the carrot symbol above the letter p. The formula is the square root of p
hat multiplied by one minus p hat divided by n. For example, suppose you survey 100
teenagers about their sneaker preferences and find that your estimate for the population proportion of teens who
prefer slip-on sneakers is 10% or 0.1. In this case, p hat is 0.1 and n is 100. When you plug in the numbers
into the formula for standard error of the proportion,
it = 0.03. As your sample size gets larger,
your standard error gets smaller. Because standard error measures the
difference between your sample proportion and the true population proportion. As your sample gets larger, your sample proportion gets closer
to the true population proportion. The more accurate the estimate
of the population proportion, the smaller the standard error. Your estimate will help stakeholders at
the Sneaker Company make decisions about product development. Based on your results, they may want to put less money
into developing slip-on sneakers. Typically, the next step for a data
professional would be to use the standard error to construct a confidence interval. This describes the uncertainty
of your estimate and gives your stakeholders more detailed
information about your results. Later on in this course, you’ll learn how
to calculate and interpret confidence intervals to more accurately predict
preferences of a population.

Reading: The sampling distribution of the mean

Reading

Practice Quiz: Test your knowledge: Sampling distributions

A data professional is analyzing data about a population of aspen trees. They take repeated random samples of 10 trees from the population and compute the mean height for each sample. Which of the following statements best describes the sampling distribution of the mean?

The central limit theorem implies which of the following statements? Select all that apply.

What is a standard error?

Work with sampling distributions in Python


Lab: Annotated follow-along guide: Sampling distributions with Python

Reading

Video: Sampling distributions with Python

Understanding Population Parameters through Sampling

  • Point Estimates: Data professionals often want to know characteristics (like the average literacy rate) of an entire population. Directly measuring everyone is often impractical. Instead, we take a random sample and calculate the sample mean as an estimate (point estimate) of the true population mean.
  • Sampling Variability: Due to chance, the sample mean will likely differ slightly from the population mean. Different random samples will produce different point estimates.
  • Central Limit Theorem
    • Larger sample sizes lead to more accurate estimates.
    • The distribution of sample means from many random samples approaches a normal distribution centered around the true population mean.

Simulating Sampling with Python

  • Example: The text demonstrates how to use Python code (sample(), mean(), etc.) to simulate taking random samples of districts and calculating sample means for literacy rate.
  • 10,000 Samples: Simulating taking 10,000 random samples shows:
    • The distribution of these sample means is approximately normal.
    • The average of these sample means closely aligns with the true population mean.

Key Takeaways

  • Sampling allows us to estimate population characteristics even if we can’t measure the whole population.
  • Python provides tools to simulate and visualize this process.
  • Understanding the Central Limit Theorem helps us interpret the accuracy and variability of our estimates.

Earlier, we talked about
how data professionals use sample data to make point estimates about
population parameters. For instance, if
you want to know the average age of
registered voters in Japan, you could take a survey
of 100 registered voters. Then you could use
the average age of the survey respondents, as a point estimate of the average age of all
registered voters. If your sample size
is large enough, your sample mean will give you a pretty good estimate
of the population mean. In this video, you’ll use Python to simulate
random sampling. Then based on your sample data, you’ll make a point estimate
of a population mean. We’ll continue with our previous scenario in which you’re a data professional working for the Department of Education
of a large nation. Recall that you’re
analyzing data on the literacy rate
for each district. You’ll continue to use the data set you
worked with before. If you need to access
the data, do so now. For this video, we’ll make a
slight change to our story. Imagine that you are asked
to collect the data on district literacy rates and that you have limited
time to do so. You can only survey 50
randomly chosen districts, instead of 634 districts included in your
original data set. The goal of your research
study is to estimate the mean literacy rate
for all 634 districts, based on your sample
of 50 districts. You can use Python
to simulate taking a random sample of 50
districts from your data set. Now, let’s open up a Jupyter
notebook and get to work. To start, import the Python packages you
will use numpy, pandas, and statsmodels.api, and the library you will
use matplotlib.pyplot. To save time,
rename each package in library with an
abbreviation np, pd, plt, and sm. To load the scipy
stats module write from scipy import stats. First, you’ll want to get a random sample of 50 districts. A cool feature of Python is
that you can use code to simulate random sampling and choose the desired sample size. To do this, use the sample()
function in pandas. Before you write the code, let’s review the function
and its arguments. To simulate random sampling, use the following arguments
in the sample function n, replace, and random_state. n refers to the
desired sample size, replace indicates
whether you are sampling with or
without replacement, random_state refers to the
seed of the random number. Let’s explore each
argument in more detail. First, sample size or the
number or items in your sample. In this case, you want to
take a random sample of 50 district literacy rates from the overall literacy column. Second, replacement. In general, you can sample
with or without replacement. When a population element can be selected more than one time, you were sampling
with replacement. When a population element can
be selected only one time, you were sampling
without replacement. For example, suppose we
have a jar that contains 100 unique numbers from 1-100. You want to select
a random sample of numbers from the jar. After you pick a
number from the jar, you can put the number aside or you can put it
back in the jar. If you put the number
back in the jar, it may be selected
more than once. This is sampling
with replacement. If you put the number aside, it can be selected
only one time. This is sampling
without replacement. For the purposes of our example, you will sample
with replacement. The final part of the code is random_state or the seed
of the random number. A random seed is
a starting point for generating random numbers. You can use any
arbitrary number to fix the random seed and give the random number generator
a starting point. Also, going forward you can use the same random seed to generate
the same set of numbers. In a later video, you’ll
work with the sample again. Now you’re ready to
write your code. First, name a new
variable sampled_data. Then set the arguments
for the sample function. n sample size equals 50, replace equals true because you’re sampling
with replacement, for random_state, choose an arbitrary number
for your random seed. How about 31,208? Now, display the value
of your variable. The output shows 50 districts selected randomly
from your data set. Each has a different
literacy rate. Now that you have
your random sample, use the mean function to
compute the sample mean. First, name a new
variable, estimate1. Next, use the mean function to compute the mean for
your sample data. The sample mean for
district literacy rate is about 74.22 percent. This is a point estimate
of the population mean based on your random
sample of 50 districts. Remember that the
population mean is the literacy rate
for all districts. Due to sampling variability, the sample mean is usually not exactly the same as
the population mean. Next, let’s find out what will happen if you
compute the sample mean based on another random
sample of 50 districts. To generate another
random sample, name a new variable estimate 2. Then set the arguments
for the sample function. Once again, n is 50, and replace is true. This time, choose a
different number for your random state to
generate different sample. How about 56,810? Finally, add the mean function
at the end of your line of code to
compute the sample mean. Display the value
of your variable. For your second estimate, the sample mean for your
district literacy rate is about 74.24 percent. Due to sampling variability, this sample mean is different from the sample mean of
your previous estimate, 74.22 percent but
they’re really close. Recall that the
central limit theorem tells you that when the
sample size is large enough, the sample mean approaches a normal distribution and as you sample more observations
from a population, the sample mean gets closer
to the population mean. The larger your sample size, the more accurate
your estimate of the population mean
is likely to be. In this case, the
population mean is the overall literacy rate for all districts in the nation. In a previous video, you found that the
population mean literacy rate is 73.39 percent. Based on sampling, your
first estimated sample mean was 74.22 percent, and your second estimate
was 74.24 percent. Each estimate is relatively
close to the population mean. Now, imagine you
repeated this study 10,000 times and obtain 10,000 point estimates
of the mean. In other words, you take
10,000 random samples of 50 districts and compute
the mean for each sample. According to the
central limit theorem, the mean of your
sampling distribution will be roughly equal
to the population mean. You can use Python to
compute the mean of the sampling distribution
with 10,000 samples. Let’s go over the
code step by step. First, create an empty list to store the sample mean
from each sample. Name this estimate_list. Second, set up a for loop
with the range function. The loop will run 10,000 times and iterate over each
number in the sequence. Third, specify what you want to do in each
iteration of the loop. The sample function tells
the computer to take a random sample of 50
districts with replacement. The argument n equals 50 and the argument
replace equals true. The append function adds a single item to
an existing list. In this case, it
appends the value of the sample mean to
each item in the list. Your code generates a
list of 10,000 values, each of which is the sample
mean from a random sample. Next, create a new data frame for your list of
10,000 estimates. Name a new variable estimate_df
to store your data frame. Now, name a new variable
mean_sample_means. Then compute the mean for your sampling distribution
of 10,000 random samples. Display the value
of your variable. The mean of your
sampling distribution is about 73.41 percent. This is essentially identical to the population mean of
your complete dataset, which is about 73.4 percent. To visualize the
relationship between your sampling
distribution of 10,000 estimates and the normal distribution, we can plot both
at the same time. For now, don’t worry about the code, as it’s beyond
the scope of this course. I want to share three
takeaways from this graph. First, as the central
limit theorem predicts, the histogram of the
sampling distribution is well approximated by the
normal distribution. The outline of the histogram closely follows
the normal curve. Second, the mean of the
sampling distribution, the blue dotted line, overlaps with the
population mean, the green solid line. This shows that
the two means are essentially equal to each other. Third, the sample mean of your first estimate
of 50 districts, the red dashed line is
farther away from the center. This is due to
sampling variability. The central limit theorem shows that as you increase
the sample size, your estimate becomes
more accurate. For a large enough sample, the sample mean closely
follows a normal distribution. Your first sample of
50 districts estimated the mean district literacy
rate is 74.22 percent, which is relatively close to the population mean
of 73.4 percent. To ensure your estimate will
be useful to the government, you can compare the
nation’s literacy rate to other benchmarks, such as the global literacy rate or the literacy rate
of peer nations. If the nation’s literacy rate
is below these benchmarks, this may help convince
the government to devote more resources to improving
literacy across the country. Estimating population
parameters through sampling is a powerful form
of statistical inference. When you’re dealing
with large numbers and complex calculations, Python helps you quickly
make accurate estimates.

Lab: Activity: Explore sampling

Reading

Lab: Exemplar: Explore sampling

Practice Quiz: Test your knowledge: Work with sampling distributions in Python

Which Python function can be used to simulate random sampling?

Which of the following statements describe a random seed when specifying random_state in pandas.DataFrame.sample()? Select all that apply.

Review: Sampling


Video: Wrap-up

Why Sampling Matters

  • Practicality: Working with entire populations is often impossible due to cost, time, or sheer data size. Sampling makes analysis feasible.
  • Large Datasets: Modern data analytics often involves massive datasets, necessitating well-chosen samples for efficient analysis.

Key Concepts

  • Sampling Process: Understand the steps involved, from defining your target population to collecting the sample data.
  • Sampling Methods:
    • Probability sampling: Ensures representativeness using random selection.
    • Non-probability sampling: Prone to bias due to non-random selection.
  • Bias: Be aware of how different sampling methods can introduce bias, which can distort your results and insights.
  • Sampling Distributions: Understand how to work with distributions of sample means and proportions to estimate population parameters.
  • Central Limit Theorem: This powerful theorem allows estimation of the population mean even with non-normally distributed datasets.
  • Python Tools: Know how to use SciPy for sampling distribution calculations and population parameter estimations.

Important Reminders

  • Always be critical about the origins of your sample data to ensure the validity of your analysis.
  • Representativeness is key! A biased sample won’t accurately reflect the population you’re trying to study.

You now have a solid
foundation in sampling, which will serve you well in your future role as
a data professional. In the data career space, you’ll be working with
sample data all the time. Throughout this
part of the course, we’ve explored how
data professionals use sample data to
make inferences, predictions, and estimates
about populations. Sampling is so useful, because it’s often
too expensive, time-consuming,
or complicated to collect data for an
entire population. Sometimes, a complete
dataset may be too large to process
even for a computer. Effective sampling is especially important in modern
data analytics, because data professionals often manage extremely large datasets. For example, you might work with economic data that has tens
of millions of data points, and need to use a
sample of 10,000. As a working data professional, it’s important to understand the sampling process used to
generate your sample data, and whether or not
your sample is representative of
your population. Plus, as you now know, different types of bias affect different
sampling methods. Early on, we reviewed
the different stages of the sampling process
from choosing a target population to
collecting data for your sample. Then we discussed the two main
types of sampling methods, probability sampling, and
non-probability sampling. We went over the benefits, and drawbacks of each method, and how random sampling can help ensure that your sample
is high-quality, and representative
of your population. We also discussed different
forms of bias in sampling, and how bias affects
non-probability sampling methods. You’ll learn that any
insights you draw from bias data may not be accurate, or useful to your stakeholders. After that, you learned about sampling distributions
for both sample means, and proportions,
and how to estimate the corresponding
population parameters. We also covered the
central limit theorem, and how it helps you
estimate the population mean for many different
types of datasets. Finally, you learned how to use Python SciPy stats module to work with sampling
distributions, and make a point estimate
of a population mean. Coming up, you’ll take
a graded assessment. To prepare, check out the reading that lists all
the new terms you’ve learned. Feel free to revisit videos, readings, and other resources
that cover key concepts. Congratulations on your
progress. Let’s keep it going.

Reading: Glossary terms from module 3

Terms and definitions from Course 4, Module 3

Quiz: Module 3 challenge

Which of the following statements accurately describe a representative sample? Select all that apply.

What stage of the sampling process refers to creating a list of all the items in the target population?

Which of the following statements accurately describe non-probability sampling? Select all that apply.

Which sampling method involves dividing a population into groups and randomly selecting some members from each group for the sample?

The instructor of a fitness class asks their regular students to take an online survey about the quality of the class. What sampling method does this scenario refer to?

Fill in the blank: Standard error measures the _____ of a sampling distribution.

What concept states that the sampling distribution of the mean approaches a normal distribution as the sample size increases?

A data professional is working with data about annual household income. They want to use Python to simulate taking a random sample of income values from the dataset. They write the following code: sample(n=100, replace=True, random_state=230). What is the sample size of the random sample?