Module 3: Sampling

Data professionals use smaller samples of data to draw conclusions about large datasets. You’ll learn about the different methods they use to collect and analyze sample data and how they avoid sampling bias. You’ll also learn how sampling distributions can help you make accurate estimates.

Learning Objectives

Use Python for sampling
Explain the concept of standard error
Define the central limit theorem
Explain the concept of sampling distribution
Explain the concept of sampling bias
Describe the benefits and drawbacks of non-probability sampling methods such as convenience, voluntary response, snowball, and purposive
Describe the benefits and drawbacks of probability sampling methods such as simple random, stratified, cluster, and systematic
Explain the difference between probability sampling and non-probability sampling
Describe the main stages of the sampling process
Explain the concept of a representative sample

Table Of Contents

Introduction to sampling
Sampling distributions
Work with sampling distributions in Python
Review: Sampling

Introduction to sampling

Video: Welcome to module 3

Notes

Transcript

Building on Your Knowledge

You’ve mastered descriptive statistics and basic probability – these form the foundation for more advanced techniques.
Data science is constantly evolving, so continuous learning is key!

Focus on Sampling

What is Sampling? Selecting a smaller subset (sample) from a larger group (population) to make inferences.
Why Sample? Studying entire populations can be impractical. A representative sample allows you to draw conclusions about the whole.

Topics We’ll Cover

Review: Revisiting inferential statistics and what makes a sample representative.
Sampling Process: Steps from defining your target population to collecting the sample data.
Types of Sampling: Probability (random chance) vs. non-probability methods, with their pros and cons.
Bias: Understanding how sampling methods can introduce bias (undercoverage, non-response, etc.)
Sampling Distributions: How sample statistics relate to the population.
Central Limit Theorem: A powerful tool for estimating population means.
Python Skills: Using the SciPy stats module for sampling and analysis.

Hello again. It’s great to be back to continue
our learning journey together. You’ve learned a lot so far. You now have a better understanding of the
essential role statistics plays in data science. You also have a solid foundation for
using descriptive statistics and basic probability to describe,
analyze and interpret data. Your knowledge of the fundamental concepts
of statistics is the first step on a path that leads to more advanced methods
like hypothesis testing and regression analysis. What’s really exciting is that this
learning journey will continue throughout your future career as a data professional. The amount of data in the world
is always growing and the data career space is
constantly advancing. I often read about new machine learning
methods to keep up with the changes in the field and
develop new skills to use at work. The next stage of your journey
is all about sampling or the process of selecting a subset
of data from a population. For example if you want to survey
a population of 100,000 people you can select a representative
sample of 100 people. Then you can draw conclusions about
the population based on your sample data coming up. We’ll go over how the sampling process
works and how data professionals use sample data to better understand larger
populations recall that in statistics, a population can refer to any type of
data including people objects, events, measurements and more. We’ll start with a review
of infrared statistics and examine the concept of
a representative sample. Next we’ll go over the different stages
of the sampling process from choosing a target population to collecting data for
your sample. Then we’ll explore the two main types of
sampling methods, probability sampling and non probability sampling. We’ll discuss the benefits and
drawbacks of various sampling methods and describe how random sampling can help
ensure that your sample is representative of the population, will also
introduce different forms of bias and sampling like under coverage and
non response bias and how they can affect non
probability sampling methods. After that,
we’ll explore sampling distributions, which are probability distributions for
sample statistics. You learn about sampling distributions for
both sample means and proportions and how to estimate the corresponding
values for populations. We’ll also cover the central
limit theorem and how it can help you estimate the population
mean for different types of datasets. Finally, you learn how to use
python’s Saipi stats module to work with sampling distributions and make
a point estimate of the population mean. When you’re ready,
I’ll meet you in the next video

Video: Cliff: Value everyone’s contributions

Notes

Transcript

Cliff’s Role at Google

Uses data to enhance employee well-being, productivity, and overall experience.
Contributes to HR strategy, hybrid work policies, and location planning.

Developing Confidence in Analytics

Embracing teamwork: Recognizes that collaboration and diverse skillsets are key, not needing to have all the answers individually.

Communication Strategies

Understanding Business Goals: Initial meetings focus on partners’ broader objectives, not just project specifics. This helps situate your data insights within their larger success picture.
Active Listening & Confirmation: Plays back his understanding of partners’ questions/goals to ensure alignment.
Presenting Data-Driven Options: Instead of just offering answers, suggests various possibilities based on the data. This sparks dialogue and lets partners identify what resonates most.

Key Takeaway

Cliff emphasizes a balance between listening to partners and using your data expertise to collaboratively find the best solutions.

[MUSIC] My name is Cliff and
I’m a workforce planning and people analytics lead at Google. I use data to help our employees be
more productive, more connected, and just overall improve their well-being. I also use data to improve
our HR practices and focus on our hybrid work policies
as well as our location strategy. I’ve always been interested in
issues of workforce development, people strategy, human resources. But I didn’t anticipate what
a central role analytics would play in my work and
how much I would come to love analytics. One of the things that has helped me
develop a confident voice in this space has just been understanding that we work
in teams, we work cross-functionally, and I don’t need to bring all of
the solutions to a problem, right? I’ll bring perspective on how we can
use data to solve this problem, but I’m working with people who
also bring a wealth of skills. And looking at it really
as a partnership where it’s really about leveraging the best
of everyone in the team, that’s helped me bring a lot
of confidence to the work. My go to strategy for communications when working with partners
is to first set up a few low-stakes meetings just to understand what
their broader business goals are. I’m not even thinking about the specific
project we’re working on, but more broadly, how do they define success? That helps me understand where the work
that we’re doing fits into the context of their bigger picture. The second thing I do from a communication
standpoint is to try to play back what I think I heard somebody say. Sort of to repeat it back to them, whether
it’s my understanding of their question or the output that they’re
trying to see from the data, just to test if I actually
really understand their goals. When I’m working with somebody and
I feel we’re not getting to the root of a question or a problem, what I find
is really helpful is laying out from a data perspective a set of different
options or possibilities for them. And engaging in a conversation around
which of those really resonate for them. And so, it’s finding a balance
between listening and telling as a way to
unlock some of my data, they might not have
thought about themselves.

Video: Introduction to sampling

Notes

Quiz

Transcript

Why Sampling Matters

Practicality: Analyzing an entire population is often impossible, especially with large datasets. Sampling saves time, money, and resources.
Example: Determining the percentage of laptop users in a city is much more feasible with a sample than surveying the entire population.

The Importance of Representative Samples

Accurate Inferences: A representative sample accurately reflects the characteristics of the larger population you’re interested in.
Bad Samples = Bad Insights: If your sample is biased (e.g., only includes computer scientists), your conclusions about the population will be unreliable.
Example: Using only the heights of professional basketball players wouldn’t give you an accurate picture of the average height within the full population.

Key Takeaways for Data Professionals

“Garbage in, garbage out”: A sophisticated model can’t fix the problems caused by a non-representative sample.
Focus on Sample Quality: Ensure that your sample is representative of the target population to make reliable inferences and predictions.

A representative sample does not reflect the characteristics of a population.

False

A representative sample accurately reflects the characteristics of a population. If a sample does not accurately reflect the characteristics of a population, then the inferences will likely be unreliable and predictions inaccurate. This can lead to negative outcomes for stakeholders and organizations.

Earlier in the course, we briefly
discussed the difference between descriptive and inferential statistics. Descriptive statistics like the mean and standard deviation summarize
the main features of the dataset. Inferential statistics use sample
data to draw conclusions or make predictions about
a larger population. Now we’re going to return to
inferential statistics and explore the relationship between
sample and population in more detail. This part of the course
is all about sampling, the process of drawing a subset
of data from a population. In this video, we’ll discuss how data professionals
use sampling in data science and the importance of working with the sample
that is representative of the population. Data professionals use sampling to
analyze many different types of data. Here are some questions that sampling
has helped my data science team answer, how many products in an app store do
we need to test to feel confident that all the products
are secure from malware? How do we select a sample of users
to run an effective A/B test for an online retail store? And how do we select a sample of customers
of a video streaming service to get reliable feedback on the shows they watch? Sampling is useful in data science because
selecting a sample requires less time than collecting data on
every item in a population. Using a sample saves money and
resources and analyzing a sample is more practical
than analyzing an entire population. This is especially important
in modern data analytics where you often deal with
extremely large datasets. For example, let’s say you want to know
the percentage of people in a large city who uses a laptop computer. One way to do this is to survey
every resident in the city. First, it would be very difficult
to access contact information for every resident of the city. Second, giving a survey to every resident
of the city would be very expensive, complicated and time consuming. Another way is to find a much
smaller subset of residents and give them a survey. This subset is your sample, then you can
use the sample data you collect about laptop use to draw conclusions about
the laptop use of the entire population. Collecting a sample is faster,
more practical and less expensive than collecting data
on every member of the population. Keep in mind that your sample should
be representative of your population. Recall that a representative sample
accurately reflects the characteristics of a population. The inferences and predictions you make about your
population are based on your sample data. If your sample doesn’t accurately
reflect your population, then your inferences will not be reliable
and your predictions will not be accurate. And this can lead to negative outcomes for
stakeholders and organizations. For instance, let’s say you only
contact computer scientists for your laptop survey. Your sample will not accurately
reflect the overall population. Computer scientists are much more likely
to use a laptop computer than the typical city resident. Many residents may not have
access to any kind of computer or even know how to use one. A sample that only includes computer
scientists is not representative. A representative sample would include
people with different levels of computer knowledge and access. Let’s consider another scenario. Imagine you want to find out
the average height of every adult in the United States, that’s a lot of people. It would take an incredible
amount of time, energy and money to even attempt to measure
every person in the country. Instead you can take
a sample of 100 people and use that sample data to draw conclusions
about the entire population. Now, let’s say you have sample data only
from professional basketball players. Pro basketball players are really tall,
some are over seven feet tall. On average, they’re much taller than
almost everybody else in the population. Their average height does not accurately
reflect the average height of the overall population. A sample that includes only pro basketball
players is not representative of every adult in the US. As a data professional,
I work with sample data every day. I can tell you that having a
representative sample is super important. A wise teammate of mine once said that a
good model can’t overcome a bad sample and the right. Data professionals work with powerful
statistical tools that can model complex datasets and
help generate valuable insights. But if the sample data you’re working
with does not accurately reflect your population that is if your
sample is not representative, then it ultimately doesn’t
matter how good your model is. If your predictive model
is based on a bad sample, then your predictions
will not be accurate. Ultimately, the quality of
your sample helps determine the quality of the insights
you share with stakeholders. To make reliable inferences about all your
customers based on feedback from a sample of customers, make sure your sample
is representative of the population.

Reading: The relationship between sample and population

Reading

Earlier, you learned that inferential statistics use sample data to draw conclusions or make predictions about a larger population. Data professionals use inferential statistics to gain valuable insights about their data.

In this reading, you’ll learn about the relationship between sample and population in more detail. We’ll also discuss how data professionals use sampling in data work, and the importance of working with a sample that is representative of the population.

Population and Sample

Population vs. sample

In statistics, a population includes every possible element that you are interested in measuring, or the entire dataset that you want to draw conclusions about. A statistical population can refer to any type of data, including:

People
Organizations
Objects
Events
And more

For instance, a population might be the set of:

All students at a university
All the cell phones ever manufactured by a company
All the forests on Earth

A sample is a subset of a population.

Samples drawn from the above populations might be:

The math majors at the university
The cell phones manufactured by the company in the last week
The forests in Canada

Data professionals use samples to make inferences about populations. In other words, they use the data they collect from a small part of the population to draw conclusions about the population as a whole.

Sampling

Sampling is the process of selecting a subset of data from a population.

In practice, it’s often difficult to collect data on every member or element of an entire population. A population may be very large, geographically spread out, or otherwise difficult to access. Instead, you can use sample data to draw conclusions, make estimates, or test hypotheses about the population as a whole.

Data professionals use sampling because:

It’s often impossible or impractical to collect data on the whole population due to size, complexity, or lack of accessibility
It’s easier, faster, and more efficient to collect data from a sample
Using a sample saves money and resources
Storing, organizing, and analyzing smaller datasets is usually easier, faster, and more reliable than dealing with extremely large datasets

Example: election poll

Imagine you’re a data professional working in a country with a large population like India, Indonesia, the United States, or Brazil. There is an upcoming national election for president. You want to conduct an election poll to see which candidate voters prefer. Let’s say the population of eligible voters is 100 million people. To survey 100 million people on their voting preferences would take an enormous amount of time, money, and resources – even assuming it would be possible to locate and contact all voters, and that all voters would be willing to participate.

However, it is realistic to survey a sample of 100 or 1000 voters drawn from the larger population of all voters. When you’re dealing with a large population, sampling can help you make valid inferences about the population as a whole.

Representative sample

To make valid inferences or accurate predictions about a population, your sample should be representative of the population as a whole. Recall that a representative sample accurately reflects the characteristics of a population. The inferences and predictions you make about your population are based on your sample data. If your sample doesn’t accurately reflect your population, then your inferences will not be reliable, and your predictions will not be accurate. And this can lead to negative outcomes for stakeholders and organizations.

Statistical methods such as probability sampling help ensure your sample is representative by collecting random samples from the various groups within a population. These methods help reduce sampling bias and increase the validity of your results. You’ll learn more about sampling methods later on.

Example: election poll

Ideally, the sample for your election poll will accurately reflect the characteristics of the overall voter population. A voter population in a large country will be diverse in political perspectives, geographic location, age, gender, race, education level, socioeconomic status, etc. Your sample will not be representative if you only collect data from people who belong to certain groups and not others. For example, if you survey people from one political party, or who have advanced degrees, or are older than 70. The results of an election poll based on a non-representative sample will not be accurate. In general, any claims or inferences you make about any population will have more validity if they are based on a representative sample.

Key takeaways

Data professionals work with powerful statistical tools that can model complex datasets and help generate valuable insights. But, if the sample data you’re working with doesn’t accurately reflect your population—that is, if your sample is not representative—then it doesn’t matter how good your model is. If your predictive model is based on a bad sample, then your predictions will not be accurate.

Ultimately, the quality of your sample helps determine the quality of the insights you share with stakeholders. To make reliable inferences about a population, make sure your sample is representative of the population.

Video: The sampling process

Notes

Quiz

Transcript

Why Sampling Matters

Data professionals constantly work with samples of larger populations.
Understanding how sampling works is crucial to assess sample quality, biases, and how well it represents the population.
Example: A biased sample (like only polling basketball players on average height) leads to bad conclusions.

5 Steps of the Sampling Process

Identify Target Population: Define the exact group you want to study (e.g., legal adult residents of Vancouver).
Select Sampling Frame: Create a practical list of the population you can access (e.g., a voter registry, even if slightly imperfect).
Choose Sampling Method:
- Probability sampling (random selection) is ideal for representativeness.
- Non-probability (convenience, researcher bias) is less reliable.
Determine Sample Size: Larger samples generally increase accuracy, but there are tradeoffs with cost and time.
Collect Sample Data: Carry out your survey, experiment, or data gathering on your selected sample.

Example: Subway Opinion Poll

The video uses the example of polling support for a new subway system to illustrate how each step affects the final results.

Key Takeaways

Careful sampling makes your sample data reflect the population you’re studying.
This leads to more reliable conclusions and better decision-making.

When working with sample data, what is the first step in the sampling process?

Identify the target population

The first step in the sampling process is to identify the target population. The sampling process helps determine whether a sample is representative of the population and if it is unbiased.

As a data professional,
you’ll work with sample data all the time. Often this will be sample data previously
collected by other researchers. Sometimes your team may
collect their own data. Either way, it’s important to know
how the sampling process works because it directly affects
the quality of your sample data. The sampling process helps determine
whether your sample is representative of the population and
whether your sample is unbiased. If you estimate the mean height of a
country’s total adult population based on a sample of professional
basketball players, your estimate will not be
accurate. In this video, we’ll go over the main stages
of the typical sampling process. This will give you a useful framework for
understanding how sampling is conducted and how the sampling process
can affect your sample data. To get a clear overview of the sampling
process, let’s divide it into five steps. Step one, identify the target population. Step two, select the sampling frame. Step three,
choose the sampling method. Step four, determine the sample size, and step five,
collect the sample data. As an example, let’s consider a public opinion poll. Imagine the city government of Vancouver,
Canada wants to build a new subway system. The public will vote on whether or
not to move forward with the project. The city government wants to find out if
there’s public support for the project. They ask you to take a poll and estimate the percentage of adult
residents that support the project. Legal adults are 18 years or older. The first stage in the sampling process
is to identify your target population. The target population is the complete set
of elements that you’re interested in knowing more about. In this case, the target population
includes every resident in the city who is 18 years or older and eligible to vote. Let’s say that the city contains 100,000
such residents. Since it’s too difficult and too expensive to survey
everyone in the target population, you decide to take a sample. The next step in the sampling process
is to create a sampling frame. A sampling frame is a list of all
the items in your target population. Basically, it’s a complete list of
everyone or everything you’re interested in researching. The difference
between a target population and a sampling frame is that the population
is general and the frame is specific. So if your target population is 100,000
city residents who are 18 years or older and eligible to vote, your sampling
frame could be a list of names for all these residents–from
Alana Aoki to Zoe Zappa. For practical reasons, your sampling frame may
not accurately match your target population because you may not have
access to every member of the population. For instance, the city may not have
reliable contact information about each resident, or perhaps not all eligible
voters are actually registered to vote, so their opinions about the potential
subway system aren’t relevant, since the project will be decided by
an election. For reasons like these, your sampling frame will not exactly
overlap with your target population. Your sampling frame will include
the list of residents 18 or over that you’re able to obtain
useful information about. So the sampling frame is the accessible
part of your target population. Next, you need to choose a sampling method,
which is step three of the sampling process. One way to help ensure that your sample
is representative is to choose the right sampling method. There are two main types of sampling
methods: probability sampling and non-probability sampling. In later videos, we’ll explore the specifics
in more detail. For now, just know that probability sampling uses
random selection to generate a sample. Non-probability sampling is
often based on convenience or the personal preferences of the researcher
rather than random selection. Because probability sampling methods are
based on random selection, every person in the population has an equal chance
of being included in the sample. This gives you the best chance to get
a representative sample, as your results are more likely to accurately
reflect the overall population. So, assuming you have the budget and the
time, you can use a probability sampling method for
your poll about the subway project. Using random selection gives you the best
chance of getting a sample that’s representative of your population. Step four of the sampling process is to
determine the best size for your sample, since you don’t have the resources to
pull everyone in your sampling frame. In statistics, sample size refers to the
number of individuals or items chosen for a study or experiment. Sample size helps determine
the accuracy of the predictions you make about the population. In general, the larger the sample size, the more accurate your predictions. Based
on the desired level of accuracy for your survey, you can decide how many
eligible voters to include in your sample. Now, you’re ready to
collect your sample data. This is the final step
in the sampling process. To pull the residents selected for
your sample, you decide to conduct a survey.
Based on the survey responses, you determine the percentage
of eligible voters 18 and over who favor the proposed
subject project. Then, you share this information with city
leaders to help them make a more informed decision. Effective sampling ensures that
your sample data is representative of your target population. Then, when you use sample data to make
inferences about the population, you can be reasonably confident
that your inferences are reliable. Your poll will give city leaders
a better idea of public support for the new subway and help inform
future decisions about the project. The decisions you make at each step of the
sampling process can affect the quality of your sample data. Understanding the sampling process will
make you a better data professional, whether you’re analyzing data
collected by other researchers or conducting a survey on your own.

Reading: The stages of the sampling process

Reading

Recently, you’ve been learning about sampling. As a data professional, you’ll work with sample data all the time. Often, this will be sample data previously collected by other researchers; sometimes, your team may collect their own data. Either way, it’s important to know how the sampling process works, because it helps determine whether your sample is representative of the population, and whether your sample is unbiased.

In this reading, we’ll go over the main stages of the sampling process in more detail. This will give you a better understanding of how the sampling process works and how each step of the process can affect your sample data.

The sampling process

First, let’s review the main steps of the sampling process:

Identify the target population
Select the sampling frame
Choose the sampling method
Determine the sample size
Collect the sample data

Let’s explore each step in more detail with an example. Imagine you’re a data professional working for a company that manufactures home appliances. The company wants to find out how customers feel about the innovative digital features on their newest refrigerator model. The refrigerator has been on the market for two years and 10,000 people have purchased it. Your manager asks you to conduct a customer satisfaction survey and share the results with stakeholders.

Step 1: Identify the target population

The first step in the sampling process is defining your target population. The target population is the complete set of elements that you’re interested in knowing more about. Depending on the context of your research, your population may include individuals, organizations, objects, events, or any other type of data you want to investigate.

A well-defined population reduces the probability of including participants who do not fit the precise scope of your research. For example, you don’t want to include all the company’s customers, or customers who purchased the company’s other refrigerator models.

In this case, your target population will be the 10,000 customers who purchased the company’s newest refrigerator model. These are the customers you want to survey to learn about their experience with the newest model.

Step 2: Select the sampling frame

The next step in the sampling process is to create a sampling frame. A sampling frame is a list of all the individuals or items in your target population.

The difference between a target population and a sampling frame is that the population is general and the frame is specific. So, if your target population is all the customers who purchased the refrigerator, your sampling frame could be an alphabetical list of the names of all these customers. The customers in your sample will be selected from this list.

Ideally, your sampling frame should include the entire target population. However, for practical reasons, your sampling frame may not exactly match your target population, because you may not have access to every member of the population. For instance, the company’s customer database may be incomplete, or contain data processing errors. Or, some customers may have changed their contact information since their purchase, and you may be unable to locate or contact them. Furthermore, sometimes the sampling frame might include elements outside of the target population simply by accident or because it is impossible to know the target population with certainty.

Therefore, generally your sampling frame is the accessible part of your target population, but sometimes it will include elements apart from this set.

Step 3: Choose the sampling method

The third step in the sampling process is choosing a sampling method.

There are two main types of sampling methods: probability sampling and non-probability sampling. Later on, we’ll explore the specific methods in more detail. For now, just know that probability sampling uses random selection to generate a sample. Non-probability sampling is often based on convenience, or the personal preferences of the researcher, rather than random selection. Often, probability sampling methods require more time and resources than non-probability sampling methods.

Ideally, your sample will be representative of the population. One way to help ensure that your sample is representative is to choose the right sampling method. Because probability sampling methods are based on random selection, every element in the population has an equal chance of being included in the sample. This gives you the best chance to get a representative sample, as your results are more likely to accurately reflect the overall population.

So, assuming you have the budget, the resources, and the time, you should use a probability sampling method for your survey.

Step 4: Determine the sample size

Step four of the sampling process is to determine the best size for your sample, since you don’t have the resources to survey everyone in your sampling frame. In statistics, sample size refers to the number of individuals or items chosen for a study or experiment.

Sample size helps determine the precision of the predictions you make about the population. In general, the larger the sample size, the more precise your predictions. However, using larger samples typically requires more resources.

The sample size you choose depends on various factors, including the sampling method, the size and complexity of the target population, the limits of your resources, your timeline, and the goal of your research.

Based on these factors, you can decide how many customers to include in your sample.

Step 5: Collect the sample data

Now, you’re ready to collect your sample data, which is the final step in the sampling process.

You give a customer satisfaction survey to the customers selected for your sample. The survey responses provide useful data on how customers feel about the digital features of the refrigerator. Then, you share your results with stakeholders to help them make more informed decisions about whether to continue to invest in these features for future versions of this refrigerator, and develop similar features for other models.

Key takeaways

Effective sampling ensures that your sample data is representative of your population. Then, when you use sample data to make inferences about the population, you can be reasonably confident that your inferences are reliable.

The decisions you make at each step of the sampling process can affect the quality of your sample data. Understanding the sampling process will make you a better data professional, whether you’re analyzing data collected by other researchers or conducting a survey on your own.

Video: Compare sampling methods

Notes

Quiz

Transcript

Types of Probability Sampling

Simple Random Sampling:
- Every member of the population has an equal chance of selection.
- Pro: Unbiased, representative results.
- Con: Can be expensive/time-consuming for large populations.
Stratified Random Sampling:
- Population is divided into strata (groups) and members are randomly selected from each group.
- Pro: Ensures representation from all relevant groups.
- Con: Requires knowledge of the population to pick effective strata.
Cluster Random Sampling:
- Population is divided into clusters, and entire clusters are randomly chosen for the sample.
- Pro: Useful for large, diverse populations with clear subgroups.
- Con: Clusters may not perfectly reflect the overall population.
Systematic Random Sampling:
- Population is ordered, and members are selected at regular intervals from a random starting point.
- Pro: Quick and convenient with a full member list.
- Con: Requires knowing the total population size beforehand.

Key Point: All probability sampling methods rely on random selection to reduce bias and get more accurate results compared to non-probability methods.

Fill in the blank: Probability sampling uses ____ selection to generate a sample.

Random

Probability sampling uses random selection to generate a sample. There are four methods: simple, stratified, cluster, and systematic. All are based on random selection, which is the preferred method of sampling for most data professionals

In a previous video, you learned the
differences between probability and
non-probability sampling. Then you conducted a survey
using probability sampling, which is the third step
of the sampling process. In this video, you
will learn more about the different methods
of probability sampling. Then we’ll discuss the benefits and drawbacks of each method. There are four different
probability sampling methods: simple random sampling, stratified random sampling,
cluster random sampling, and systematic random sampling. In a simple random sample, every member of a
population is selected randomly and has an equal
chance of being chosen. You can randomly
select members using a random number generator or by another method of
random selection. For example, say
you want to survey the employees of a company
about their work experience. The company employs
1,000 people. You can assign each employee in the company database a
number from 1-1,000, and then use a random
number generator to select 100 people
for your sample. The main benefit of simple random samples is
that they’re usually fairly representative since
every member of the population has an equal
chance of being chosen. Random samples
tend to avoid bias and surveys like these give
you more accurate results. However, in practice, it’s often expensive
and time-consuming to conduct large, simple
random samples. If your sample size
is not large enough, a specific group of people in the population may be
underrepresented in your sample. If you use a larger sample size, your sample will more accurately
reflect the population. The next method of
probability sampling is a stratified random sample. In a stratified random sample, you divide a population
into groups and randomly select some members from each group to
be in the sample. These groups are called strata. Strata can be organized
by age, gender, income, or whatever category you’re interested in studying. For example, say you want to
survey high school students about how much time they
spend studying on weekends. You might divide the
student population according to age: 14, 15, 16, and 17-year-olds. Then you can survey
an equal number of students from each age group. Stratified random samples
help ensure that members from each group
in the population are included in the survey. This method allows you to draw more accurate conclusions
about the relevant groups. For instance, 14-year-olds
and 17-year-olds, may have different perspectives about studying on the weekends. Older students can
drive and may have more social activities
or work on the weekends. Stratified sampling will
capture both perspectives. One main disadvantage of stratified sampling is that
it can be difficult to identify appropriate strata for a study if you lack
knowledge of a population. For example, if
you want to study median income among
a population, you may want to stratify
your sample by job type, or industry, or location,
or education level. If you don’t know how relevant these categories are
to median income, it will be difficult to choose the best one for your study. The next method of
probability sampling is a cluster random sample. When you’re conducting a
cluster random sample, you divide a population
into clusters, randomly select
certain clusters, and include all members from the chosen clusters
in the sample. Cluster sampling is similar to stratified random sampling, but in stratified sampling, you randomly choose some members from each group to
be in the sample. In cluster sampling, you choose all members from a group
to be in the sample. Clusters are divided using identifying details,
such as age, gender, location, or
whatever you want to study. For example, imagine you
want to conduct a survey of employees at a global company
using cluster sampling. The company has 10 offices in different cities
around the world. Each office has about
the same number of employees and similar job roles. You randomly select
three offices in three different
cities as clusters. You include all the employees at the three offices
in your sample. One advantage of this method
is that a cluster sample gets every member from
a particular cluster, which is useful
when each cluster reflects the
population as a whole. This method is helpful
when dealing with large and diverse
populations that have clearly defined subgroups. If researchers want
to learn about the preferences of primary
school students in Oslo, Norway, they can
use one school as a representative sample of
all schools in the city. A main disadvantage of cluster sampling is that
it may be difficult to create clusters
that accurately reflect the overall population. For example, for
practical reasons, you may only have access
to the offices in the United States
when your company has locations all
over the world. Employees in the
United States may have different characteristics
and values than employees in
other countries. The final method of
probability sampling is a systematic random sample. In systematic random samples, you put every member of a population into an
ordered sequence. Then you choose a
random starting point in the sequence and select members for your
sample at regular intervals. Let’s assume you want to survey students at a community college. For a systematic random sample, you’d put the students’
names in alphabetical order, randomly choose a
starting point, and pick every fifth name
to be in the sample. Systematic random samples are often representative
of the population since every member has an equal chance of being
included in the sample. Whether the student’s
last name starts with B or R isn’t going to affect
their characteristics. Systematic sampling
is also quick and convenient when you have a complete list of the
members of your population. One disadvantage of systematic sampling
is that you need to know the size of the population that you want to study
before you begin. If you don’t have
this information, it’s difficult to choose
consistent intervals. The four methods of probability
sampling we’ve covered, simple, stratified, cluster, and systematic, are all
based on random selection, which is the preferred method of sampling for most
data professionals. These methods can
help you create a sample that is representative
of the population. In an upcoming video, we’ll check out some methods of non-probability sampling and why they’re not considered
representative.

Reading: Probability sampling methods

Reading

Probability-sampling-methods Download

Video: The impact of bias in sampling

Notes

Quiz

Transcript

The Importance of Representative Samples

Machine learning models trained on biased data are likely to make unfair and inaccurate decisions.
Representative samples, where everyone in the population has an equal chance of being included, help reduce bias and create fairer models.

Non-Probability Sampling

While cheaper and more convenient than probability sampling, non-probability methods are prone to bias.
These methods are useful for early exploration but shouldn’t be relied on for making conclusions about the entire population.

Four Main Types of Non-Probability Sampling

Convenience Sampling: Choosing participants who are easy to reach.
- Prone to undercoverage bias (certain groups are underrepresented)
- Example: Polling people at a school, missing those who don’t attend.
Voluntary Response Sampling: People volunteer to participate.
- Prone to nonresponse bias (those with strong opinions are more likely to respond).
- Example: A restaurant’s online survey will likely get extreme positive or negative views.
Snowball Sampling: Participants recruit others to join.
- Participants tend to be too similar, making the sample unrepresentative.
- Example: Study on cheating, where the initial few students recruit friends who might also have cheated.
Purposive Sampling: Picking participants based on specific criteria.
- Intentionally excludes groups, making the sample focused but potentially biased.
- Example: Surveying only high-GPA students about teaching methods misses insights from those who struggle.

Key Takeaway for Data Professionals

Be constantly aware of bias, from data collection to presenting results.
Understanding common types of bias helps you spot them in your work and take steps to minimize them.

Sampling bias occurs when a sample is not representative of the population as a whole.

True

Sampling bias occurs when a sample is not representative of the population as a whole. Models based on representative samples are much more likely to lead to fair and unbiased decisions.

In my work as a data
professional I often use sample data to help build
machine learning models. Today, a machine-learning
model may help determine if a person
gets an approval for a loan, an interview for a job or an
accurate medical diagnosis. Models based on
representative samples are much more likely to make fair and unbiased decisions about who gets a loan
or a job interview. Using samples that
are representative of the different
types of people in the population helps ensure that each person receives the
treatment that is best for them. Unfortunately, bias can
affect sample data. Sampling bias occurs
when a sample is not representative of the
population as a whole. To eliminate bias, I try
to use samples that are representative of the
overall population. The consequences of
drawing conclusions from a non-representative
sample can be serious. Recently you learned that
probability sampling methods use random selection, which helps avoid sampling bias. A randomly chosen sample
means that all members of the population have an equal
chance of being included. In contrast, non-probability
sampling methods do not use random selection, so they do not typically
generate representative samples. In fact, they often
result in biased samples. However, non-probability
sampling is often less expensive and more convenient
for researchers to conduct. Sometimes due to budget, time or other reasons, it’s just not possible to
use probability sampling. Plus non-probability
methods can be useful for exploratory studies
which seek to develop an initial
understanding of a population, not draw conclusions or make predictions about the
population as a whole. In this video, we’ll
discuss four methods of non-probability
sampling and learn how sampling bias can
affect each method. These four methods are
convenient sampling, voluntary response sampling, snowball sampling, and
purposive sampling. Let’s start with
convenience sampling. In this method, you
choose members of a population that are
easy to contact or reach. As the name suggests, conducting a convenience
sample involves collecting a sample from
somewhere convenient to you, such as your workplace, a local school,
or a public park. For example, to conduct
an opinion poll, a researcher might stand in
front of a local high school during the day and poll people
that happened to walk by. Because these symbols are
based on convenience to the researcher and not a broader sample of
the population, convenience samples often
show undercoverage bias. Undercoverage bias occurs
when some members of a population are inadequately
represented in the sample. For instance, people
who don’t work at or attend the school will not be represented as much
in this sample. The next method of
non-probability sampling is voluntary response sampling. This type of sample
consists of members of a population who volunteer
to participate in a study. For example, the
owners of a restaurant want to know how people feel
about their dinner options. They ask their regular
customers to take an online survey about the quality of the
restaurant’s food. Voluntary response
samples tend to suffer from nonresponse bias, which occurs when certain
groups of people are less likely to
provide responses. People who voluntarily respond will likely have
stronger opinions, either positive or negative than the rest of the population. This makes the
volunteer customers at the restaurant an
unrepresentative sample. The next non-probability
sampling method is snowball sampling. In a snowball sample researchers recruit initial
participants to be in a study and then
ask them to recruit other people to
participate in the study. Like a snowball, the
sample size gets bigger and bigger as more
participants join in. For example, if a study was investigating cheating
among college students, potential participants might
not want to come forward. But if a researcher can find a couple of students
willing to participate, these two students
may know others who have also cheated on exams. The initial participants could then recruit others by sharing the benefits of the study and reassuring them of
confidentiality. Although it may seem
convenient that study participants
help build the sample, this type of recruiting
can lead to sampling bias. Because initial
participants recruit additional participants
on their own, it’s likely that most of them will share similar
characteristics. These characteristics might be unrepresentative of the total
population under study. In purposive
sampling researchers select participants based on
the purpose of their study. Because participants
are selected for the sample according to
the needs of the study, applicants who do not fit
the profiles are rejected. For example, a researcher
wants to survey students on the effectiveness of certain teaching methods
at the university. The researcher only wants to include students who regularly attend class and have an established record of
academic achievement. So they select the students with the highest grade point averages to participate in the study. Purposive sampling,
the researcher often intentionally
exclude certain groups from the sample to focus
on a specific group they think is most
relevant to their study. In this case, the
researcher excludes students who don’t have
high grade point averages. This could lead to
biased outcomes because the students
in the sample are not likely to be representative of the overall
student population. As a data professional, you have to think about bias and fairness from the
moment you start collecting sample data to the time you present
your conclusions. Once you become aware of
some common forms of bias, you can remain on the alert
for bias in any form.

Reading: Non-probability sampling methods

Reading

Recently, you learned that probability sampling methods use random selection, which helps avoid sampling bias. A randomly chosen sample means that all members of the population have an equal chance of being included. In contrast, non-probability sampling methods do not use random selection, so they do not typically generate representative samples. In fact, non-probability methods often result in biased samples. Sampling bias occurs when some members of the population are more likely to be selected than other members.

In this reading, you’ll learn more about four methods of non-probability sampling, and learn how sampling bias can affect each method. We’ll also discuss why non-probability sampling may be useful in certain situations.

Non-probability sampling methods

Non-probability samples use non-random methods of selection, so not all members of a population have an equal chance of being selected. This is why non-probability methods have a high risk of sampling bias. However, non-probability methods are often less expensive and more convenient for researchers to conduct. Sometimes, due to limited time, money, or other resources, it’s not possible to use probability sampling. Plus, non-probability methods can be useful for exploratory studies, which seek to develop an initial understanding of a population, rather than make inferences about the population as a whole.

We’ll go over four methods of non-probability sampling:

Convenience sampling
Voluntary response sampling
Snowball sampling
Purposive sampling

Let’s explore each method in more detail.

Convenience sampling

For convenience sampling, you choose members of a population that are easy to contact or reach. As the name suggests, conducting a convenience sample involves collecting a sample from somewhere convenient to you, such as your workplace, a local school, or a public park.

For example, to conduct an opinion poll, a researcher might stand at the entrance of a shopping mall during the day and poll people that happen to walk by.

Because these samples are based on convenience to the researcher, and not a broader sample of the population, convenience samples often suffer from undercoverage bias. Undercoverage bias occurs when some members of a population are inadequately represented in the sample. For example, the above sample will underrepresent people who don’t like to shop at malls, or prefer to shop at a different mall, or don’t visit the mall because they lack transportation.

Convenience sampling is often quick and inexpensive, but it’s not a reliable way to get a representative sample.

Voluntary response sampling

A voluntary response sample consists of members of a population who volunteer to participate in a study. Like a convenience sample, a voluntary response sample is often based on convenient access to a population. However, instead of the researcher selecting participants, participants volunteer on their own.

For example, let’s say college administrators want to know how students feel about the food served on campus. They email students a link to an online survey about the quality of the food, and ask students to fill out the survey if they have time.

Voluntary response samples tend to suffer from nonresponse bias, which occurs when certain groups of people are less likely to provide responses. People who voluntarily respond will likely have stronger opinions, either positive or negative, than the rest of the population. In this case, only students who really like or really dislike the food may be motivated to fill out the survey. The survey may omit many students who have more mild opinions about the food, or are neutral. This makes the volunteer students unrepresentative of the overall student population.

Snowball sampling

In a snowball sample,researchers recruit initial participants to be in a study and then ask them to recruit other people to participate in the study. Like a snowball, the sample size gets bigger and bigger as more participants join in. Researchers often use snowball sampling when the population they want to study is difficult to access.

For example, imagine a researcher is studying people with a rare medical condition. Due to reasons of confidentiality, it may be difficult for the researcher to obtain contact information for members of this population from hospitals or other official sources. However, if the researcher can find a couple of people willing to participate, these two people may know others with the same condition. The initial participants could then recruit others by sharing the potential benefits of the study.

Snowball sampling can take a lot of time, and researchers must rely on participants to successfully continue the recruiting process and build up the “snowball.” This type of recruiting can also lead to sampling bias. Because initial participants recruit additional participants on their own, it’s likely that most of them will share similar characteristics, and these characteristics might be unrepresentative of the total population under study.

Purposive sampling

In purposive sampling, researchers select participants based on the purpose of their study. Because participants are selected for the sample according to the needs of the study, applicants who do not fit the profile are rejected.

For example, imagine a game development company wants to conduct market research on a new video game before its public release. The research team only wants to include gaming experts in their sample. So, they survey a group of professional gamers to provide feedback on potential improvements.

In purposive sampling, researchers often intentionally exclude certain groups from the sample to focus on a specific group they think is most relevant to their study. In this case, the researcher excludes amateur gamers. Amateur gamers may purchase the new game for different reasons than professional gamers, and enjoy game features that don’t appeal to professionals. This could lead to biased results, because the professionals in the sample are not likely to be representative of the overall gamer population.

Purposive sampling is often used when a researcher wants to gain detailed knowledge about a specific part of a population, or where the population is very small and its members all have similar characteristics. Purposive sampling is not effective for making inferences about a large and diverse population.

Key takeaways

Non-probability sampling is useful for collecting data in situations where you have limited time, budget, and other resources. Non-probability sampling is also useful for exploratory research, when you want to get an initial understanding of a population, rather than make inferences about the population as a whole. However, it’s important to remember that non-probability sampling methods have a high risk of sampling bias.

As a data professional, you have to think about bias and fairness from the moment you start collecting sample data to the time you present your conclusions. Once you become aware of some common forms of bias, you can remain on the alert for bias in any form.

Practice Quiz: Test your knowledge: Introduction to sampling

A data professional is conducting an election poll. As a first step in the sampling process, they identify the target population. What is the second step in the sampling process?

Select the sampling frame

The second step in the sampling process is to select the sampling frame.

Gemini

The correct answer is Select the sampling frame. Here’s why:

Steps in the Sampling Process

Identify the target population: Clearly define the group you aim to understand (e.g., likely voters in an election).
Select the sampling frame: Construct a list or accessible section of the target population from which you can actually draw your sample (e.g., a voter registration list).
Choose the sampling method: Determine how you’ll select individuals from the sampling frame (e.g., simple random sampling, stratified sampling).
Determine the sample size: Decide on the number of individuals needed for a representative sample.
Collect the sample data: Carry out the process of gathering information from the selected individuals.

Why the other options are incorrect:

Determine the sample size: This is important, but it usually comes after the sampling frame is established.
Collect the sample data: Data collection is the final stage of the process.
Choose the sampling method: While crucial for ensuring representativeness, the sampling method must be based on an established sampling frame.

ChatGPT

The second step in the sampling process is to select the sampling frame.

Fill in the blank: In a _____ sample, every member of a population is selected randomly and has an equal chance of being chosen.

simple random

In a simple random sample, every member of a population is selected randomly and has an equal chance of being chosen.

Gemini

The correct answer is simple random.

Here’s a quick definition of the terms:

Simple Random Sample: Every member of the population has an equal chance of being selected, and each possible sample of a given size is equally likely to be chosen.
Cluster Sample: The population is divided into clusters, and a random selection of clusters is chosen. All individuals within the chosen clusters are sampled.
Voluntary Response Sample: Individuals choose to participate (biased, not representative).
Snowball Sample: Initial respondents are used to recruit additional participants through their networks.

ChatGPT

In a simple random sample, every member of a population is selected randomly and has an equal chance of being chosen.

Non-probability sampling includes which of the following sampling methods? Select all that apply.

Purposive sampling

Non-probability sampling methods include convenience and purposive sampling. In convenience sampling, researchers choose members of a population that are easy to contact or reach. In purposive sampling, researchers select participants based on the purpose of their study.

Convenience sampling

Gemini

Here’s the breakdown of why the correct answers are:

Purposive sampling: Researchers deliberately choose participants based on specific characteristics that are relevant to the study. This is a non-probability method because not everyone in the population has an equal chance of being selected.
Convenience sampling: Participants are selected based on their ease of access or availability. Again, this is a non-probability method because it doesn’t give everyone in the population an equal chance.

Why the others are incorrect:

Systematic random sampling: This is a probability sampling method where a starting point is randomly chosen, and then individuals are selected at fixed intervals
Stratified random sampling: This is a probability sampling method where the population is divided into subgroups (strata), and then random samples are taken from each subgroup.

Key takeaway: The main difference between probability sampling and non-probability sampling is that in probability methods, everyone in the population has a known and equal chance of being selected.

ChatGPT

Non-probability sampling includes the following sampling methods:

Purposive sampling
Convenience sampling

Sampling distributions

Video: How sampling affects your data

Notes

Quiz

Transcript

What is a Sampling Distribution?

A sampling distribution is a probability distribution that represents the possible values of a sample statistic (like the sample mean).
It’s created by taking repeated random samples of the same size from a population.
The sampling distribution of the mean shows how the means of different samples from the same population would be distributed.

Key Points about the Sampling Distribution of the Mean

Central Limit Theorem: As the sample size increases, the sampling distribution of the mean approaches a normal distribution, even if the population distribution is not normal.
Mean: The mean of the sampling distribution of the mean is equal to the population mean.
Variability: The variability in the sampling distribution is measured by the standard error of the mean. The standard error gets smaller as the sample size increases.
Estimation: The sample mean can be used as a point estimate to approximate the population mean.

Standard Error

The standard error of the mean represents the average amount that sample means deviate from the true population mean.
A smaller standard error means a sample mean is more likely to be a good estimate of the population mean.
Formula: Standard Error = Sample Standard Deviation / Square Root of Sample Size (S / √n)

Key Takeaway

Understanding sampling distributions and the standard error are critical concepts in statistics. They help data professionals assess the accuracy and reliability of estimates derived from sample data.

What term describes a probability distribution of a sample statistic?

Sampling distribution

Sampling distribution describes a probability distribution of a sample statistic. Probability distribution represents the possible outcomes of a random variable; sampling distribution represents the possible outcomes for a sample statistic.

In previous videos, you learned
how the sampling process works and the benefits and drawbacks of various
sampling methods. As a data professional, I often work with sample data to make informed predictions about future sales revenue
or product performance. Understanding how
sampling effects your data both positively
and negatively, will be important in
your future career in data analytics. For instance, one-way
data professionals use sample statistics is to
estimate population parameters. As you may recall, a statistic
is a characteristic of a sample and a parameter is a characteristic
of a population. For example, the mean weight of a random sample of 100
penguins is a statistic. The mean weight of the
total population of 10,000 penguins is a parameter. A data professional might use the mean weight of the sample of 100 penguins to estimate the mean weight of
the population. This type of estimate is
called a point estimate. A point estimate uses a single value to estimate
a population parameter. In this video, we’ll discuss
the concept of sampling distribution and
how it can help you represent the possible
outcomes of a random sample. You will also learn how the
sampling distribution of the sample mean
can help you make a point estimate of
the population mean. A sampling distribution is a probability distribution
of a sample statistic. Recall that a
probability distribution represents the possible
outcomes of a random variable, such as a coin toss or die roll. In the same way, a sampling
distribution represents the possible outcomes for a sample statistic,
like the mean. Imagine you take repeated
simple random samples of the same size
from a population. Since each sample is random, the mean value will
vary from sample to sample in a way that cannot
be predicted with certainty. Right now, this may
seem a bit abstract. To get a better idea of the sampling distribution
of the mean, let’s continue with
our penguin example. Imagine you’re
setting a population of 10,000 blue penguins, which are the smallest of
all known penguin species. You want to find out
the mean weight of a blue penguin in
this population. Since it would take
too long to locate and weigh every single penguin, you instead collect sample
data from the population. Let’s say you take
repeated simple random samples of 10 penguins, each from the population. In other words, you randomly choose 10 penguins
from the group, weigh them, and then repeat this process with a different
set of 10 penguins. For your first sample, you find the mean weight of the 10 penguins is 3.1 pounds. For your second sample, the mean weight of the 10
penguins is 2.9 pounds. For your third sample, the mean weight is 2.8
pounds, and so on. Imagine that the
true mean weight of a penguin in this
population is three pounds. Although in practice,
you wouldn’t know this unless you weighed
every single penguin. Each time you take a
sample of 10 penguins, it’s likely that the mean
weight of the penguins in your sample will be close to the population mean
of three pounds, but not exactly 3 pounds. Every once in a while, you
may get a sample full of smaller than average
penguins with a mean weight of
2.5 pounds or less. Or you might get a simple
full of larger than average penguins
with a mean weight of 3.5 pounds or more. The mean weight will vary
randomly from sample to sample. Sampling variability
refers to how much an estimate varies
between samples. You can use a sampling
distribution to represent the frequency of all your
different sample means. I find that it helps to visualize these samples
as a histogram. Let’s plot 10 simple random
samples of 10 penguins each. The most frequently
occurring sample means will be around three pounds. The least frequent
sample means will be the more extreme ways such
as 2.3 or 3.7 pounds. As you increase the
size of a sample, the mean weight of your
sample data will get closer to the mean weight
of the population. If you sample the entire
population, in other words, if you actually weighed
all 10,000 penguins, your sample mean would be the same as your population mean. But to get an accurate estimate
of the population mean, you don’t have to
weigh 10,000 penguins. If you take a large enough
sample size from a population, say 100 penguins, your sample mean will be an accurate estimate of
the population mean. This point is based on the
central limit theorem, which we’ll explore in more detail later
on in the course. For now, just know that if
your sample is large enough, your sample mean will roughly
equal the population mean. For instance, imagine you collect a sample of 100 penguins and find that the mean weight
of your sample is 3 pounds. This means that your best
estimate for the mean weight of the entire penguin
population is also 3 pounds. You can also use
your sampling in it to estimate how accurately the mean weight of
any given sample represents the
population mean weight. This is useful to know because the mean varies from
sample to sample in any given sample is not necessarily an exact reflection
of the population mean. For example, the
true mean weight for the penguin population
might be three pounds. The mean weight for
any given sample of penguins might be 3.3 pounds, 2.8 pounds, 2.4
pounds, and so on. The more variability
in your sample data, the less likely it is
that the sample mean is an accurate estimate
of the population mean. Data professionals use
the standard deviation of the sample means to
measure this variability. Recall that the standard deviation measures
the variability of your data or how spread
out your data values are. The more spread between
the data values, the larger the
standard deviation. In statistics, the
standard deviation of a sample statistic is
called the standard error. The standard error
of the mean measures variability among all
your sample means. A larger standard
error indicates that the sample means
are more spread out, whether there’s
more variability. A smaller standard
error indicates that the sample means
are closer together, or that there’s
less variability. The less standard error, the more likely it is
that your sample mean is an accurate estimate
of the population mean. For example, say you take three random samples
of 10 penguins each, the mean weight of the
first sample is 3.3 pounds. The second is 3.1 pounds, and the third is 2.9 pounds. There’s not much variability among these three sample means. The values are all
close together. The standard error will
be relatively small. Now, say you take another three random samples
of 10 penguins each. The mean weight of the
first sample is 2.2 pounds, the second is 3.2 pounds, and the third is 4.2 pounds. There’s more variability among
these three sample means. The values are more spread out. The standard error will
be relatively large. Note that the concept
of standard error is based on the practice
of repeated sampling. In reality, researchers usually work with a single sample. It’s often too
complicated, expensive, or time-consuming to take repeated samples
of a population. Instead, statisticians
have derived a formula for calculating the standard error based on the mathematical assumption
of repeated sampling. You can use the
following formula to calculate the standard
error of the sample mean. S divided by the
square root of n, where S is the sample
standard deviation and n is the sample size. For example, in your
study of penguin weights, imagine that a sample of 100
penguins has a mean weight of three pounds and a standard
deviation of one pound. You can calculate the
standard error by dividing the sample standard deviation
1 by the square root of the sample size 100.1 divided by the square root of
100 equals 0.1. This means that your
best estimate for the true population mean weight of all penguins is 3 pounds, but you should expect that the mean weight from
one sample to the next will vary with a standard
deviation of about 0.1 pounds. As your sample size gets larger, your standard error
gets smaller. This is because standard
error measures the difference between your sample mean and
the actual population mean. As your sample gets larger, your sample mean gets closer to the actual population mean. The more accurate the estimate
of the population mean, the smaller the standard error. Say you collect a
sample of 10,000 penguins instead
of 100 penguins. You find that the
sample mean weight is 3 pounds and the sample
standard deviation is 1 pound. The standard error is one
divided by the square root of 10,000, which equals 0.01. Your best estimate
for the sample mean will still be 3 pounds, but now you can expect
that the mean weight from one sample of
penguins to the next will vary with a standard
deviation of just 0.01 pounds. In general, you can have more confidence in
your estimates as the sample size gets larger and the standard
error gets smaller. This is because the mean of your sampling distribution gets closer to the population mean. Coming up, we’ll
explore this idea further when we talk about
the central limit theorem.

Video: The central limit theorem

Notes

Quiz

Transcript

What is the Central Limit Theorem (CLT)?

Key Idea: As the sample size increases, the distribution of sample means will approach a normal distribution (bell curve), regardless of the shape of the original population distribution.
Practical Benefit: If your sample is large enough, the sample mean will be a close approximation of the true population mean.

Why the CLT Matters

No Need to Know Population Distribution: You can estimate population parameters (like the mean) without needing detailed knowledge of the entire population’s distribution.
Sample Size Guidelines: While there’s no single rule, samples of 30 or more are generally considered sufficient for the CLT to apply.
Real-World Applications: The CLT is used in economics, science, business, and more to understand things like average income, animal populations, and commute times.

Examples

Household Income: Even if the distribution of income is skewed, a large enough sample will give a sampling distribution that’s normal and provides a good estimate of average income.
Coffee Consumption: You can estimate the average coffee consumption across a large population by studying representative samples and applying the CLT.

Key Takeaway The Central Limit Theorem is a powerful tool that helps data professionals make inferences about populations based on sample data.

Fill in the blank: The central limit theorem states that the sampling distribution of the mean approaches a _____ distribution as the sample size increases

Normal

The central limit theorem states that the sampling distribution of the mean approaches a normal distribution as the sample size increases. In other words, as the sample size increases, the sampling distribution assumes the shape of a bell curve. If a large enough sample of the population is used, the sample mean will be roughly equal to the population mean.

Recently, we talked briefly about the central limit theorem and the relationship between sample size and the sample mean. Data professionals use the
central limit theorem to estimate population parameters
for data in economics, science, business,
and other fields. For example, they may use
the theorem to estimate the mean annual household income for an entire city or country, the mean height and weight for an entire animal or
plant population, or the mean commute time for all the employees of
large corporation. In this video, you’ll learn more about the central
limit theorem and how it can help you estimate the population mean for
different types of data. The central limit theorem states that the sampling
distribution of the mean approaches a
normal distribution as the sample size increases. In other words, as
your sample increases, your sampling
distribution assumes the shape of a bell curve. If you take a large enough
sample of the population, the sample mean will be roughly equal to the
population mean. For example, say you
want to estimate the average height for a university student
in South Africa. Instead of measuring
millions of students, you can get data on a
representative sample of students. If your sample size
is large enough, the mean height of
your sample will be roughly equal to the mean
height of the population. There is no exact rule for how large a sample
size needs to be in order for the central
limit theorem to apply. In general, a sample size of 30 or more is
considered sufficient. Exploratory data analysis
can help you determine how large of a sample is
necessary for a given dataset. What’s really powerful about the central limit
theorem is that it holds true for
any population. You don’t need to
know the shape of your population distribution in advance to apply the theorem. If you collect a
large enough sample, the shape of your
sampling distribution will follow a normal
distribution. This pattern is true even if your population has a
skewed distribution. For example, here is a graph for annual household income in
the US for the year 2010. The x-axis represents
annual income and the y-axis represents
the percent of households that
have that income. You may notice how
the data is skewed to the right and the shape of the distribution
is far from normal. The distribution for annual
income is skewed because of the extraordinarily high incomes of the wealthiest households. However, if you sample
it comes at random among all households and
take a large enough sample, your sampling distribution will follow a normal distribution. This is true even though the
population distribution, every US household
is not normal. The mean income of your sampling distribution will give you an accurate estimate of the mean income of the
entire population. Let’s check out another example. Imagine you’re studying
the population of coffee drinkers in
the United States. You want to know the
average amount of coffee each person
drinks per day, but you don’t have
the time or money to survey every single
coffee drinker in the US, which, by the way, is around 150 million people. Instead of surveying
the entire population, you collect repeated
random samples up to 100 coffee drinkers. Using this data, you calculate the mean amount of
coffee consumed per day for your first
sample, 22.5 ounces. For your second sample, the mean amount is 28.2 ounces. You take a third sample, the mean amount is
25.4 ounces and so on. In theory, you could take 10, 50 or 100 samples and keep
increasing the sample size until you’ve surveyed
all 150 million people about their coffee consumption. The central limit theorem says that as your
sample size increases, the shape of your sampling
distribution will increasingly resemble
a bell curve. In practice, this
specific sample size you choose will depend
on factors like budget, time, resources, and the desire level of
confidence for your estimate. If you take a large enough
sample from the population, the mean of your
sampling distribution will equal the population mean. From this sample
of the population, you can accurately estimate
the average amount of coffee consumed per day for
the entire population. In case you’re wondering, the average American
drinks around 24 ounces or three cups
of coffee per day. Based on what I’ve noticed, if we took a sample of
only data professionals, the mean value might
be even higher. Whether you’re measuring
coffee consumption or household income, the central limit theorem
is a useful method for better understanding the
distribution of your data.

Reading: Infer population parameters with the central limit theorem

Reading

Recently, you learned about the central limit theorem and how it can help you work with a wide variety of datasets. Data professionals use the central limit theorem to estimate population parameters for data in economics, science, business, and many other fields.

In this reading, you’ll learn more about the central limit theorem and how it can help you estimate the population mean for different types of data. We’ll go over the definition of the theorem, the conditions that must be met to apply the theorem, and check out an example of the theorem in action.

The central limit theorem

Definition

The central limit theorem states that the sampling distribution of the mean approaches a normal distribution as the sample size increases. In other words, as your sample size increases, your sampling distribution assumes the shape of a bell curve. And, as you sample more observations from a population, the sample mean gets closer to the population mean. If you take a large enough sample of the population, the sample mean will be roughly equal to the population mean.

For example, imagine you want to estimate the average weight of a certain class of vehicle, like light-duty pickup trucks. Instead of weighing millions of pickup trucks, you can get data on a representative sample of pickup trucks. If your sample size is large enough, the mean weight of your sample will be roughly equal to the mean weight of the population (adhering to the law of large numbers).

Note: The central limit theorem holds true for any population. You don’t need to know the shape of your population distribution in advance to apply the theorem—the distribution could be bell-shaped, skewed, or have another shape. If you collect enough samples of sufficient size, the shape of the distribution of their means will follow a normal distribution.

Conditions

In order to apply the central limit theorem, the following conditions must be met:

Randomization: Your sample data must be the result of random selection. Random selection means that every member in the population has an equal chance of being chosen for the sample.
Independence: Your sample values must be independent of each other. Independence means that the value of one observation does not affect the value of another observation. Typically, if you know that the individuals or items in your dataset were selected randomly, you can also assume independence.
- 10%: To help ensure that the condition of independence is met, your sample size should be no larger than 10% of the total population when the sample is drawn without replacement (which is usually the case).
  - Note: In general, you can sample with or without replacement. When a population element can be selected only one time, you are sampling without replacement. When a population element can be selected more than one time, you are sampling with replacement. You’ll learn more about this topic later on in the course.
Sample size: The sample size needs to be sufficiently large.

Let’s discuss the sample size condition in more detail. There is no exact rule for how large a sample size needs to be in order for the central limit theorem to apply. The answer depends on the following factors:

Requirements for precision. The larger the sample size, the more closely your sampling distribution will resemble a normal distribution, and the more precise your estimate of the population mean will be.
The shape of the population. If your population distribution is roughly bell-shaped and already resembles a normal distribution, the sampling distribution of the sample mean will be close to a normal distribution even with a small sample size.

In general, many statisticians and data professionals consider a sample size of 30 to be sufficient when the population distribution is roughly bell-shaped, or approximately normal. However, if the original population is not normal—for example, if it’s extremely skewed or has lots of outliers—data professionals often prefer the sample size to be a bit larger. Exploratory data analysis can help you determine how large of a sample is necessary for a given dataset.

Example: Annual salary

Let’s explore an example to get a better idea of how the central limit theorem works.

Imagine you’re studying annual salary data for working professionals in a large city like Buenos Aires, Cairo, Delhi, or Seoul. Let’s say the professional population you’re interested in includes 10 million people. You want to know the average annual salary for a professional living in the city. However, you don’t have the time or money to survey millions of professionals to get complete data on every salary.

Instead of surveying the entire population, you collect survey data from repeated random samples of 100 professionals. Using this data, you calculate the mean annual salary in dollars for your first sample: $40,300. For your second sample, the mean salary is: $41,100. You survey a third sample. The mean salary is $39,700. And so on. Due to sampling variability, the mean of each sample will be slightly different.

In theory, you could take a very large sample and increase the sample size until you’ve surveyed all 10 million people about their salary. The central limit theorem says that as your sample size increases, the shape of your sampling distribution will increasingly resemble a bell curve.

If you take a large enough sample from the population, the mean of your sampling distribution will be roughly equal to the population mean. From this sample of the population, you can precisely estimate the average annual salary for the entire professional population.

Note: In practice, data professionals usually take a single sample. The specific sample size they choose depends on factors like budget, time, resources, and the desired level of confidence for their estimate.

Key takeaways

The central limit theorem can help you infer population parameters like the mean even if you only have available data on a portion of the population. The larger your sample size, the more precise your estimate of the population mean is likely to be. Whether you’re measuring vehicle weight or annual salary, the central limit theorem is a useful method for better understanding your data.

Video: The sampling distribution of the proportion

Notes

Transcript

What is a Population Proportion?

It’s the percentage of individuals in a population with a specific characteristic (e.g., percentage of teens preferring slip-on sneakers).

Why Sample?

Surveying an entire population is often impractical. We use samples to estimate the true population proportion.

Sampling Distribution of the Proportion

Just like with sample means, sample proportions vary between samples.
A sampling distribution shows the frequency of different possible sample proportions.
As sample size increases, the distribution becomes approximately normal (due to the Central Limit Theorem).

Estimating the Population Proportion

If your sample is large enough, the sample proportion is a good estimate of the true population proportion.

Standard Error of the Proportion

Measures how much a sample proportion is likely to differ from the true population proportion.
Formula: √(p-hat * (1 – p-hat) / n) where p-hat is the sample proportion and n is the sample size.
Larger sample size = smaller standard error = more accurate estimate.

Key Takeaways

Sampling distributions help us understand the variability of sample proportions and how they relate to the true population proportion.
Data professionals use this knowledge to provide robust estimates and accurate information for decision-making.

In this part of the course, we’ve been
talking about how data professionals use sample statistics to
estimate population parameters. Recently, you learned how to use
the sampling distribution of the mean to estimate the actual population mean. For example, you might estimate the mean
weight of an animal population or the mean salary of all the people who
work in the hospitality industry. Data professionals also use sampling
distributions to estimate population proportions. In statistics, a population proportion
refers to the percentage of individuals or elements in a population that
share a certain characteristic. Proportions measure percentages or
parts of a whole. For example, you might survey 100
employees at a large company to estimate what percentage of all employees like
the food at the office cafeteria. Data professionals might also use the
sampling distribution of the proportion to estimate the proportion of all visitors
to a website who make a purchase before leaving. Assembly line products that meet
quality control standards, or voters who support a candidate
in an upcoming election. In this video, you’ll learn about
the sampling distribution of the sample proportion and how it can help you
estimate the population proportion. Imagine you work for
a market research firm, your client is a company that
manufactures sneakers and wants to make sure their sneakers
appeal to the largest audience. You’re asked to research sneaker
preferences among residents of Santiago, Chile, who are between 16 and
19 years old. There are 100,000 teenagers
in that age group. You might want to find out what proportion
of this population prefers slip on sneakers over sneakers with shoelaces. Since it would take too long to locate and
survey all 100,000 teens, you instead collect sample
data from the population. Let’s say you take repeated
simple random samples of 100 teenagers from the overall population. In your first sample, you find that 12%
of teenagers prefer slip on sneakers. In your second sample,
you find that 8% prefer slip on sneakers. You take a third sample,
and the proportion is 11%. Earlier, we talked about sampling
variability for the sample means or how the value of the mean varies
from one sample to the next. The same holds true for proportions. Let’s assume we know that 10% of teenagers
in the total population prefer slip-on sneakers. In most of the samples, the proportion
of teenagers who prefer slip-on sneakers will be close to the true population
proportion of 10%, but not exactly 10%. Occasionally, a sample may turn out to
have a proportion that’s very small or very large. You can use a sampling distribution to
represent the frequency of all your different sample proportions. For instance, if you take ten simple random samples
of 100 teenagers from this population, you can show the sampling distribution
of the proportion in a histogram. The most frequently occurring values
in your sample data will be around 10%. The values that occur least frequently
will be the more extreme proportions, such as 5% or 15%. As with the sample means, the central limit theorem also
applies to sample proportions. As your sample size increases, the distribution of the sample
proportion will be approximately normal. The overall average, or mean proportion,
is located in the center of the curve. If you take a sufficiently large
enough sample of teenagers, the sample proportion will be an accurate estimate
of the true population proportion. If you survey 1,000 teenagers and find that 10% prefer slip on sneakers,
this means that your best estimate for the proportion of all teenagers
who prefer slip-ons is also 10%. As with the sample mean, you can use
the standard error of the proportion to measure sampling variability. This tells you how much a particular
sample proportion is likely to differ from the true population proportion. This is useful to know because the
proportion varies from sample to sample, and any given sample proportion probably
won’t be exactly equal to the true population proportion. The true proportion of teenagers who
prefer slip on sneakers might be 10%, but the proportion of any given sample
might be 12%, 9%, 7%, and so on. The more variability in your sample data,
the less likely it is that the sample proportion is an accurate estimate
of the population proportion. It’s important to understand the accuracy
of your estimate because stakeholder decisions are often based on
the estimates you provide. You can use the following formula
to calculate the standard error of the proportion the square root of
p hat open parentheses one minus p hat closed parentheses divided by n. P hat refers to the population proportion,
and n refers to the sample size. In statistics, you say hat when you refer
to the carrot symbol above the letter p. The formula is the square root of p
hat multiplied by one minus p hat divided by n. For example, suppose you survey 100
teenagers about their sneaker preferences and find that your estimate for the population proportion of teens who
prefer slip-on sneakers is 10% or 0.1. In this case, p hat is 0.1 and n is 100. When you plug in the numbers
into the formula for standard error of the proportion,
it = 0.03. As your sample size gets larger,
your standard error gets smaller. Because standard error measures the
difference between your sample proportion and the true population proportion. As your sample gets larger, your sample proportion gets closer
to the true population proportion. The more accurate the estimate
of the population proportion, the smaller the standard error. Your estimate will help stakeholders at
the Sneaker Company make decisions about product development. Based on your results, they may want to put less money
into developing slip-on sneakers. Typically, the next step for a data
professional would be to use the standard error to construct a confidence interval. This describes the uncertainty
of your estimate and gives your stakeholders more detailed
information about your results. Later on in this course, you’ll learn how
to calculate and interpret confidence intervals to more accurately predict
preferences of a population.

Reading: The sampling distribution of the mean

Reading

Recently, you’ve learned about how data professionals use sample statistics to estimate population parameters. For example, a data professional might estimate the mean time customers spend on a retail website, or the mean salary of all the people who work in the entertainment industry.

In this reading, you’ll learn more about the concept of sampling distribution and how it can help you represent the possible outcomes of a random sample. We’ll also discuss how the sampling distribution of the sample mean can help you estimate the population mean.

Sampling distribution of the sample mean

A sampling distribution is a probability distribution of a sample statistic. Recall that a probability distribution represents the possible outcomes of a random variable, such as a coin toss or a die roll. In the same way, a sampling distribution represents the possible outcomes for a sample statistic. Sample statistics are based on randomly sampled data, and their outcomes cannot be predicted with certainty. You can use a sampling distribution to represent statistics such as the mean, median, standard deviation, range, and more.

Typically, data professionals compute sample statistics like the mean to estimate the corresponding population parameters.

Suppose you want to estimate the mean of a population, like the mean height of a group of humans, animals, or plants. A good way to think about the concept of sampling distribution is to imagine you take repeated samples from the population, each with the same sample size, and compute the mean for each of these samples. Due to sampling variability, the sample mean will vary from sample to sample in a way that cannot be predicted with certainty. The distribution of all your sample means is essentially the sampling distribution. You can display the distribution of sample means on a histogram. Statisticians call this the sampling distribution of the mean.

Note: In practice, due to limited time and resources, data professionals typically collect a single sample and calculate the mean of that sample to estimate the population mean.

Let’s explore an example to get a more concrete idea of the sampling distribution of the mean.

Example: Mean length of lake trout

You are a data professional working with a team of environmental scientists. Your team studies the effects of water pollution on fish species. Currently, your team is researching the effects of pollution on the trout population in Lake Superior, one of the Great Lakes in North America. As part of this research, they ask you to estimate the mean length of a trout. Let’s say there are 10 million trout in the lake. Instead of collecting and measuring millions of trout, you take sample data from the population.

Let’s say you take repeated simple random samples of 100 trout each from the population. In other words, you randomly choose 100 trout from the lake, measure them, and then repeat this process with a different set of 100 trout. For your first sample of 100 trout, you find that the mean length is 20.2 inches. For your second sample, the mean length is 20.5 inches. For your third sample, the mean length is 19.7 inches. And so on. Due to sampling variability, the mean length will vary randomly from sample to sample.

For the purpose of this example, let’s assume that the true mean length of a trout in this population is 20 inches. Although, in practice, you wouldn’t know this unless you measured every single trout in the lake.

Each time you take a sample of 100 trout, it’s likely that the mean length of the trout in your sample will be close to the population mean of 20 inches, but not exactly 20 inches. Every once in a while, you may get a sample full of shorter than average trout, with a mean length of 16 inches or less. Or, you might get a sample full of longer than average trout, with a mean length of 24 inches or more.

You can use a sampling distribution to represent the frequency of all your different sample means. For example, if you take 10 simple random samples of 10 trout each from the population, you can show the sampling distribution of the mean as a histogram. The most frequently occurring value in your sample data will be around 20 inches. The values that occur least frequently will be the more extreme lengths, such as 16 inches or 24 inches.

As you increase the size of a sample, the mean length of your sample data will get closer to the mean length of the population. If you sampled the entire population—in other words, if you actually measured every single trout in the lake—your sample mean would be the same as the population mean.

However, you don’t need to measure millions of fish to get an accurate estimate of the population mean. If you take a large enough sample size from the population— – say, 1000 trout —– your sample mean will be a precise estimate of the population mean (20 inches).

Standard error

You can also use your sample data to estimate how precisely the mean length of any given sample represents the population mean.

This is useful to know because the sample mean varies from sample to sample, and any given sample mean is likely to differ from the true population mean. For example, the mean length of the trout population might be 20 inches. The mean length for any given sample of trout might be 20.2 inches, 20.5 inches, 19.7 inches, and so on.

Data professionals use the standard deviation of the sample means to measure this variability. In statistics, the standard deviation of a sample statistic is called the standard error. The standard error provides a numerical measure of sampling variability. The standard error of the mean measures variability among all your sample means. A larger standard error indicates that the sample means are more spread out, or that there’s more variability. A smaller standard error indicates that the sample means are closer together, or that there’s less variability.

In practice, using a single sample of observations, you can apply the following formula to calculate the estimated standard error of the sample mean: s / √n. In the formula, s refers to the sample standard deviation, and n refers to the sample size.

For example, in your study of trout lengths, imagine that a sample of 100 trout has a mean length of 20 inches and a standard deviation of 2 inches. You can calculate the estimated standard error by dividing the sample standard deviation, 2, by the square root of the sample size, 100:

2 ÷ √100 = 2 ÷ 10 = 0.2

This means you should expect that the mean length from one sample to the next will vary with a standard deviation of about 0.2 inches.

The standard error helps you understand the precision of your estimate. In general, you can have more confidence in your estimates as the sample size gets larger and the standard error gets smaller. This is because, as your sample size gets larger, the sample mean gets closer to the population mean.

Key takeaways

Estimating population parameters through sampling is a powerful form of statistical inference. Sampling distributions describe the uncertainty associated with a sample statistic, and help you make proper statistical inferences. This is important because stakeholder decisions are often based on the estimates you provide.

Practice Quiz: Test your knowledge: Sampling distributions

A data professional is analyzing data about a population of aspen trees. They take repeated random samples of 10 trees from the population and compute the mean height for each sample. Which of the following statements best describes the sampling distribution of the mean?

The probability distribution of all the sample means

The sampling distribution of the mean is the probability distribution of all the sample means. A probability distribution represents the possible outcomes of a random variable.

Gemini

Here’s the correct answer and explanation:

The best description of the sampling distribution of the mean is:

The probability distribution of all the sample means

Why?

Sampling Distribution: A key concept in statistics, it represents the distribution of a statistic (in this case, the sample mean) calculated from multiple samples of the same size drawn from a population.
Not just one value: It’s not a single value like the average or maximum of sample means. Instead, it shows the likelihood of obtaining different possible sample means.

Key Takeaway: The sampling distribution of the mean is a theoretical distribution that helps us understand how sample means vary and how they relate to the true population mean.

ChatGPT

The correct statement is:

The probability distribution of all the sample means.

In repeated random sampling, the sampling distribution of the mean represents the distribution of all possible sample means that could be obtained from the population. It provides information about the variability of sample means and helps estimate population parameters from sample statistics.

The central limit theorem implies which of the following statements? Select all that apply.

If you take a large enough sample of the population, the sample mean will be roughly equal to the population mean.

The central limit theorem states that the sampling distribution of the mean approaches a normal distribution as the sample size increases. If a large enough sample of the population is taken, the sample mean will be roughly equal to the population mean.

The sampling distribution of the mean approaches a normal distribution as the sample size increases

Gemini

Here are the correct statements regarding the Central Limit Theorem (CLT):

If you take a large enough sample of the population, the sample mean will be roughly equal to the population mean. The CLT states that as sample size increases, the average of the sample means will get closer to the true population mean.
The sampling distribution of the mean approaches a normal distribution as the sample size increases. This is a key aspect of the CLT. Regardless of the shape of the original population distribution, the distribution of sample means becomes more normal as sample size grows.

Important Note: The CLT is about the distribution of sample means, not individual samples.

ChatGPT

The correct statement is:

The probability distribution of all the sample means.

What is a standard error?

The standard deviation of a sample statistic

A standard error is the standard deviation of a sample statistic.

Gemini

Here’s the correct answer and explanation:

The standard error is: The standard deviation of a sample statistic

Standard Deviation vs. Standard Error: Both measure variability, but in different contexts:
- Standard Deviation: Measures the spread of individual data points within a sample or population.
- Standard Error: Measures the spread of sample statistics (like the mean or proportion) from multiple samples. It tells us how much a sample statistic is likely to vary from the true population parameter.

Key Takeaway: The standard error is essential for understanding the accuracy and reliability of statistical estimates. A smaller standard error means our sample statistic is more likely to be close to the true population value.

ChatGPT

The correct answer is:

The standard error of a sample statistic.

Standard error is a measure of the variability or precision of a sample statistic, such as the mean or proportion, and it quantifies how much the sample statistic varies from the population parameter it estimates. It is often calculated as the standard deviation of the sample divided by the square root of the sample size.

Work with sampling distributions in Python

Lab: Annotated follow-along guide: Sampling distributions with Python

Reading

Instructions

All of the instructional videos with onscreen coding demonstrations have a corresponding follow-along guide that is available to you. The follow-along guide is an annotated Jupyter notebook organized to match the content from each module. It contains the same code shown in the videos for that module. This guide is provided for your reference; you do not need to add any text or run the code yourself. If you would like to run the code, you will need to run each cell sequentially for the code to function as intended.

In addition to content that is identical to what is covered in the videos, you’ll often find additional information throughout the guide to explain the purpose of each concept covered, why the code is written in a certain way, and tips for running the code.

The landing page for each follow-along notebook also provides information about data sources (when relevant) and tips on how to access and use these guides.

Data dictionary

This notebook uses a file called education_districtwise.csv. This dataset represents a list of school districts in an anonymous country. The data includes district and state names, total population, and the literacy rate.

The dataset contains:

634 rows – each row is a different school district

7 columns

Column name	Type	Description
DISTNAME	str	The names of an anonymous country’s school districts
STATNAME	str	The names of an anonymous country’s states where school districts are located
BLOCKS	int64	The number of blocks in the school district. Blocks are the smallest organizational structure in the education system of the anonymous country.
VILLAGES	int64	column shows how many villages are in each district
CLUSTERS	int64	The number of clusters in the school district. Clusters are the second smallest organizational structure in the education system of the anonymous country.
TOTPOPULAT	int64	column shows the population for each district
OVERALL_LI	int64	shows the literacy rate for each district

PDF

Annotated-follow-along-guide_-Sampling-with-Python Download

Video: Sampling distributions with Python

Notes

Transcript

Understanding Population Parameters through Sampling

Point Estimates: Data professionals often want to know characteristics (like the average literacy rate) of an entire population. Directly measuring everyone is often impractical. Instead, we take a random sample and calculate the sample mean as an estimate (point estimate) of the true population mean.
Sampling Variability: Due to chance, the sample mean will likely differ slightly from the population mean. Different random samples will produce different point estimates.
Central Limit Theorem
- Larger sample sizes lead to more accurate estimates.
- The distribution of sample means from many random samples approaches a normal distribution centered around the true population mean.

Simulating Sampling with Python

Example: The text demonstrates how to use Python code (sample(), mean(), etc.) to simulate taking random samples of districts and calculating sample means for literacy rate.
10,000 Samples: Simulating taking 10,000 random samples shows:
- The distribution of these sample means is approximately normal.
- The average of these sample means closely aligns with the true population mean.

Key Takeaways

Sampling allows us to estimate population characteristics even if we can’t measure the whole population.
Python provides tools to simulate and visualize this process.
Understanding the Central Limit Theorem helps us interpret the accuracy and variability of our estimates.

Earlier, we talked about
how data professionals use sample data to make point estimates about
population parameters. For instance, if
you want to know the average age of
registered voters in Japan, you could take a survey
of 100 registered voters. Then you could use
the average age of the survey respondents, as a point estimate of the average age of all
registered voters. If your sample size
is large enough, your sample mean will give you a pretty good estimate
of the population mean. In this video, you’ll use Python to simulate
random sampling. Then based on your sample data, you’ll make a point estimate
of a population mean. We’ll continue with our previous scenario in which you’re a data professional working for the Department of Education
of a large nation. Recall that you’re
analyzing data on the literacy rate
for each district. You’ll continue to use the data set you
worked with before. If you need to access
the data, do so now. For this video, we’ll make a
slight change to our story. Imagine that you are asked
to collect the data on district literacy rates and that you have limited
time to do so. You can only survey 50
randomly chosen districts, instead of 634 districts included in your
original data set. The goal of your research
study is to estimate the mean literacy rate
for all 634 districts, based on your sample
of 50 districts. You can use Python
to simulate taking a random sample of 50
districts from your data set. Now, let’s open up a Jupyter
notebook and get to work. To start, import the Python packages you
will use numpy, pandas, and statsmodels.api, and the library you will
use matplotlib.pyplot. To save time,
rename each package in library with an
abbreviation np, pd, plt, and sm. To load the scipy
stats module write from scipy import stats. First, you’ll want to get a random sample of 50 districts. A cool feature of Python is
that you can use code to simulate random sampling and choose the desired sample size. To do this, use the sample()
function in pandas. Before you write the code, let’s review the function
and its arguments. To simulate random sampling, use the following arguments
in the sample function n, replace, and random_state. n refers to the
desired sample size, replace indicates
whether you are sampling with or
without replacement, random_state refers to the
seed of the random number. Let’s explore each
argument in more detail. First, sample size or the
number or items in your sample. In this case, you want to
take a random sample of 50 district literacy rates from the overall literacy column. Second, replacement. In general, you can sample
with or without replacement. When a population element can be selected more than one time, you were sampling
with replacement. When a population element can
be selected only one time, you were sampling
without replacement. For example, suppose we
have a jar that contains 100 unique numbers from 1-100. You want to select
a random sample of numbers from the jar. After you pick a
number from the jar, you can put the number aside or you can put it
back in the jar. If you put the number
back in the jar, it may be selected
more than once. This is sampling
with replacement. If you put the number aside, it can be selected
only one time. This is sampling
without replacement. For the purposes of our example, you will sample
with replacement. The final part of the code is random_state or the seed
of the random number. A random seed is
a starting point for generating random numbers. You can use any
arbitrary number to fix the random seed and give the random number generator
a starting point. Also, going forward you can use the same random seed to generate
the same set of numbers. In a later video, you’ll
work with the sample again. Now you’re ready to
write your code. First, name a new
variable sampled_data. Then set the arguments
for the sample function. n sample size equals 50, replace equals true because you’re sampling
with replacement, for random_state, choose an arbitrary number
for your random seed. How about 31,208? Now, display the value
of your variable. The output shows 50 districts selected randomly
from your data set. Each has a different
literacy rate. Now that you have
your random sample, use the mean function to
compute the sample mean. First, name a new
variable, estimate1. Next, use the mean function to compute the mean for
your sample data. The sample mean for
district literacy rate is about 74.22 percent. This is a point estimate
of the population mean based on your random
sample of 50 districts. Remember that the
population mean is the literacy rate
for all districts. Due to sampling variability, the sample mean is usually not exactly the same as
the population mean. Next, let’s find out what will happen if you
compute the sample mean based on another random
sample of 50 districts. To generate another
random sample, name a new variable estimate 2. Then set the arguments
for the sample function. Once again, n is 50, and replace is true. This time, choose a
different number for your random state to
generate different sample. How about 56,810? Finally, add the mean function
at the end of your line of code to
compute the sample mean. Display the value
of your variable. For your second estimate, the sample mean for your
district literacy rate is about 74.24 percent. Due to sampling variability, this sample mean is different from the sample mean of
your previous estimate, 74.22 percent but
they’re really close. Recall that the
central limit theorem tells you that when the
sample size is large enough, the sample mean approaches a normal distribution and as you sample more observations
from a population, the sample mean gets closer
to the population mean. The larger your sample size, the more accurate
your estimate of the population mean
is likely to be. In this case, the
population mean is the overall literacy rate for all districts in the nation. In a previous video, you found that the
population mean literacy rate is 73.39 percent. Based on sampling, your
first estimated sample mean was 74.22 percent, and your second estimate
was 74.24 percent. Each estimate is relatively
close to the population mean. Now, imagine you
repeated this study 10,000 times and obtain 10,000 point estimates
of the mean. In other words, you take
10,000 random samples of 50 districts and compute
the mean for each sample. According to the
central limit theorem, the mean of your
sampling distribution will be roughly equal
to the population mean. You can use Python to
compute the mean of the sampling distribution
with 10,000 samples. Let’s go over the
code step by step. First, create an empty list to store the sample mean
from each sample. Name this estimate_list. Second, set up a for loop
with the range function. The loop will run 10,000 times and iterate over each
number in the sequence. Third, specify what you want to do in each
iteration of the loop. The sample function tells
the computer to take a random sample of 50
districts with replacement. The argument n equals 50 and the argument
replace equals true. The append function adds a single item to
an existing list. In this case, it
appends the value of the sample mean to
each item in the list. Your code generates a
list of 10,000 values, each of which is the sample
mean from a random sample. Next, create a new data frame for your list of
10,000 estimates. Name a new variable estimate_df
to store your data frame. Now, name a new variable
mean_sample_means. Then compute the mean for your sampling distribution
of 10,000 random samples. Display the value
of your variable. The mean of your
sampling distribution is about 73.41 percent. This is essentially identical to the population mean of
your complete dataset, which is about 73.4 percent. To visualize the
relationship between your sampling
distribution of 10,000 estimates and the normal distribution, we can plot both
at the same time. For now, don’t worry about the code, as it’s beyond
the scope of this course. I want to share three
takeaways from this graph. First, as the central
limit theorem predicts, the histogram of the
sampling distribution is well approximated by the
normal distribution. The outline of the histogram closely follows
the normal curve. Second, the mean of the
sampling distribution, the blue dotted line, overlaps with the
population mean, the green solid line. This shows that
the two means are essentially equal to each other. Third, the sample mean of your first estimate
of 50 districts, the red dashed line is
farther away from the center. This is due to
sampling variability. The central limit theorem shows that as you increase
the sample size, your estimate becomes
more accurate. For a large enough sample, the sample mean closely
follows a normal distribution. Your first sample of
50 districts estimated the mean district literacy
rate is 74.22 percent, which is relatively close to the population mean
of 73.4 percent. To ensure your estimate will
be useful to the government, you can compare the
nation’s literacy rate to other benchmarks, such as the global literacy rate or the literacy rate
of peer nations. If the nation’s literacy rate
is below these benchmarks, this may help convince
the government to devote more resources to improving
literacy across the country. Estimating population
parameters through sampling is a powerful form
of statistical inference. When you’re dealing
with large numbers and complex calculations, Python helps you quickly
make accurate estimates.

Lab: Activity: Explore sampling

Reading

Data Dictionary

This activity uses air quality index (AQI) data from the United States Environmental Protection Agency (EPA). The data was taken from the EPA’s own repository of AQI data and assembled for pedagogical purposes.

This activity uses a dataset called c4_epa_air_quality.csv. The variables in the dataset have been adjusted to suit the objectives of this lab, and may vary from other versions of the dataset you have worked with previously. It contains air quality readings for various counties in the United States. For more information about AQI data, refer to AirNow.

The dataset contains:

260 rows – each row is a different aqi reading

10 columns

Column name	Type	Description
[None]	int	index
date_local	datetime	Date of AQI reading
state_name	str	The name of the state
county_name	str	The name of the U.S. county
city_name	str	U.S. City where monitoring occurred
local_site name	str	The identifier of the site in the owning agency’s (e.g., not US EPA) nomenclature
parameter_name	str	The description assigned in the air quality system to the parameter measured by the monitor. Parameters may be pollutants or non-pollutants (e.g., wind speed).
units_of_measure	str	The unit of measure for all data in that row. Every parameter has a standard unit of measure.
aritmetic_mean	int	The measure of central tendency obtained from the sum of the observed pollutant data values in the quarterly data set divided by the number of values that comprise the sum for the quarterly data set. For criteria pollutants, the sum of values only adds the values with the appropriate flagging and concurrence for the exceptional data type.
aqi	int	“Air Quality Index” – AQI is like a thermometer that runs from 0 to 500. The higher the AQI value, the greater the level of air pollution and the greater the health concern.

PDF

Activity_Explore-sampling Download

Lab: Exemplar: Explore sampling

Reading

Exemplar_Explore-sampling Download

Practice Quiz: Test your knowledge: Work with sampling distributions in Python

Which Python function can be used to simulate random sampling?

pandas.DataFrame.sample()

The sample() function can be used to simulate random sampling.

Gemini

The correct answer is pandas.DataFrame.sample(). Here’s a breakdown of why the other options aren’t the best fit:

pandas.DataFrame.describe(): This function provides summary statistics (e.g., count, mean, standard deviation, min, max) for numerical columns in a DataFrame. It’s not used for random sampling.
pandas.DataFrame.hist(): This function creates a histogram plot, which visualizes the distribution of values in a column. It doesn’t directly perform random sampling.
pandas.DataFrame.mean(): This calculates the average (mean) of a numerical column. It doesn’t involve random selection.

How to use pandas.DataFrame.sample()

Here’s a basic example:

Python

import pandas as pd

# Sample data
df = pd.DataFrame({'data':[1,2,3,4,5]})

# Get a random sample of 2 rows
result = df.sample(2) 

print(result)

ChatGPT

The Python function that can be used to simulate random sampling is:

pandas.DataFrame.sample(): This function is used to randomly sample rows or columns from a DataFrame. It allows you to specify the number of samples or the fraction of rows/columns to be returned.

Which of the following statements describe a random seed when specifying random_state in pandas.DataFrame.sample()? Select all that apply.

Any non-negative integer can be chosen to fix the random seed.
The same random seed may be used over again to generate the same set of numbers.
A random seed is a starting point for generating random numbers.

A random seed is a starting point for generating random numbers. Any number can be chosen to fix the random seed, and the same random seed can be used over again to generate the same set of numbers.

Gemini

Here’s the breakdown of the correct statements about random seeds:

The same random seed may be used over again to generate the same set of numbers. This is true! Random seeds ensure reproducibility of results.
A random seed is a starting point for generating random numbers. Absolutely correct! The seed initializes the random number generator.
Any non-negative integer can be chosen to fix the random seed. Yes! While specific libraries might have additional restrictions, in general, any non-negative integer works.

Incorrect Statement:

Only a negative number may be chosen to fix the random seed. This is not true. Random seeds are typically non-negative integers.

Let me know if you’d like a more technical explanation of how random number generators work with seeds!

ChatGPT

The following statements describe a random seed when specifying random_state in pandas.DataFrame.sample():

The same random seed may be used over again to generate the same set of numbers.
A random seed is a starting point for generating random numbers.

These statements accurately describe the concept of a random seed. It provides a starting point for the random number generation process, and using the same seed value ensures reproducibility of results. However, the statement “Only a negative number may be chosen to fix the random seed” is incorrect. In Python, any non-negative integer can be chosen to fix the random seed.

Review: Sampling

Video: Wrap-up

Notes

Transcript

Why Sampling Matters

Practicality: Working with entire populations is often impossible due to cost, time, or sheer data size. Sampling makes analysis feasible.
Large Datasets: Modern data analytics often involves massive datasets, necessitating well-chosen samples for efficient analysis.

Key Concepts

Sampling Process: Understand the steps involved, from defining your target population to collecting the sample data.
Sampling Methods:
- Probability sampling: Ensures representativeness using random selection.
- Non-probability sampling: Prone to bias due to non-random selection.
Bias: Be aware of how different sampling methods can introduce bias, which can distort your results and insights.
Sampling Distributions: Understand how to work with distributions of sample means and proportions to estimate population parameters.
Central Limit Theorem: This powerful theorem allows estimation of the population mean even with non-normally distributed datasets.
Python Tools: Know how to use SciPy for sampling distribution calculations and population parameter estimations.

Important Reminders

Always be critical about the origins of your sample data to ensure the validity of your analysis.
Representativeness is key! A biased sample won’t accurately reflect the population you’re trying to study.

You now have a solid
foundation in sampling, which will serve you well in your future role as
a data professional. In the data career space, you’ll be working with
sample data all the time. Throughout this
part of the course, we’ve explored how
data professionals use sample data to
make inferences, predictions, and estimates
about populations. Sampling is so useful, because it’s often
too expensive, time-consuming,
or complicated to collect data for an
entire population. Sometimes, a complete
dataset may be too large to process
even for a computer. Effective sampling is especially important in modern
data analytics, because data professionals often manage extremely large datasets. For example, you might work with economic data that has tens
of millions of data points, and need to use a
sample of 10,000. As a working data professional, it’s important to understand the sampling process used to
generate your sample data, and whether or not
your sample is representative of
your population. Plus, as you now know, different types of bias affect different
sampling methods. Early on, we reviewed
the different stages of the sampling process
from choosing a target population to
collecting data for your sample. Then we discussed the two main
types of sampling methods, probability sampling, and
non-probability sampling. We went over the benefits, and drawbacks of each method, and how random sampling can help ensure that your sample
is high-quality, and representative
of your population. We also discussed different
forms of bias in sampling, and how bias affects
non-probability sampling methods. You’ll learn that any
insights you draw from bias data may not be accurate, or useful to your stakeholders. After that, you learned about sampling distributions
for both sample means, and proportions,
and how to estimate the corresponding
population parameters. We also covered the
central limit theorem, and how it helps you
estimate the population mean for many different
types of datasets. Finally, you learned how to use Python SciPy stats module to work with sampling
distributions, and make a point estimate
of a population mean. Coming up, you’ll take
a graded assessment. To prepare, check out the reading that lists all
the new terms you’ve learned. Feel free to revisit videos, readings, and other resources
that cover key concepts. Congratulations on your
progress. Let’s keep it going.

Reading: Glossary terms from module 3

Terms and definitions from Course 4, Module 3

Central Limit Theorem: The idea that the sampling distribution of the mean approaches a normal distribution as the sample size increases

Cluster random sample: A probability sampling method that divides a population into clusters, randomly selects certain clusters, and includes all members from the chosen clusters in the sample

Convenience sample: A non-probability sampling method that involves choosing members of a population that are easy to contact or reach

Descriptive statistics: A type of statistics that summarizes the main features of a dataset

Inferential statistics: A type of statistics that uses sample data to draw conclusions about a larger population

Non-probability sampling: A sampling method that is based on convenience or the personal preferences of the researcher, rather than random selection

Nonresponse bias: Refers to when certain groups of people are less likely to provide responses

Point estimate: A calculation that uses a single value to estimate a population parameter

Population: Every possible element that someone is interested in measuring

Population proportion: The percentage of individuals or elements in a population that share a certain characteristic

Probability sampling: A sampling method that uses random selection to generate a sample

Purposive sample: A non-probability sampling method that involves researchers selecting participants based on the purpose of their study

Random seed: A starting point for generating random numbers

Representative sample: A sample that accurately reflects the characteristics of a population

Sample: A subset of a population

Sample size: The number of individuals or items chosen for a study or experiment

Sampling: The process of selecting a subset of data from a population

Sampling bias: Refers to when a sample is not representative of the population as a whole

Sampling distribution: A probability distribution of a sample statistic

Sampling frame: A list of all the items in a target population

Sampling variability: Refers to how much an estimate varies between samples

Sampling with replacement: Refers to when a population element can be selected more than one time

Sampling without replacement: Refers to when a population element can be selected only one time

Simple random sample: A probability sampling method in which every member of a population is selected randomly and has an equal chance of being chosen

Snowball sample: A method of non-probability sampling that involves researchers recruiting initial participants to be in a study and then asking them to recruit other people to participate in the study

Standard error: The standard deviation of a sample statistic

Standard error of the mean: The sample standard deviation divided by the square root of the sample size

Stratified random sample: A probability sampling method that divides a population into groups and randomly selects some members from each group to be in the sample

Systematic random sample: A probability sampling method that puts every member of a population into an ordered sequence, chooses a random starting point in the sequence, and selects members for the sample at regular intervals

Target population: The complete set of elements that someone is interested in knowing more about

Undercoverage bias: Refers to when some members of a population are inadequately represented in a sample

Voluntary response sample: A method of non-probability sampling that consists of members of a population who volunteer to participate in a study

Quiz: Module 3 challenge

Which of the following statements accurately describe a representative sample? Select all that apply.

A representative sample reflects the characteristics of the overall population.
A representative sample helps data professionals make reliable inferences based on sample data.

Gemini

A representative sample reflects the characteristics of the overall population. Absolutely! A key goal is for the sample to accurately mirror the demographics, proportions, and other important traits of the population you’re studying.
A representative sample helps data professionals make reliable inferences based on sample data. Definitely! With a representative sample, you can extrapolate findings to the broader population with more confidence.

Incorrect Statements

A representative sample represents some groups in the population but not others. This describes a biased sample, not a representative one.
A representative sample suffers from sampling bias. False! The goal of a representative sample is to minimize sampling bias.

ChatGPT

The following statements accurately describe a representative sample:

A representative sample reflects the characteristics of the overall population.
A representative sample helps data professionals make reliable inferences based on sample data.

A representative sample is one that accurately mirrors the key characteristics of the entire population from which it is drawn. It enables data professionals to make reliable inferences and generalizations about the population based on the sample data.

The statement “A representative sample represents some groups in the population but not others” is incorrect. A representative sample should ideally include all relevant groups within the population to accurately reflect its diversity.

The statement “A representative sample suffers from sampling bias” is also incorrect. A representative sample is by definition free from sampling bias, as it accurately represents the population without favoring any particular subgroup or characteristic. Sampling bias occurs when certain groups or characteristics are disproportionately represented or excluded in the sample, leading to inaccurate conclusions.

What stage of the sampling process refers to creating a list of all the items in the target population?

Select the sampling frame

Gemini

The correct answer is Select the sampling frame.

Here’s why:

Sampling Frame: A list or database of all the members within your target population. It’s the starting point for selecting your sample.
Other Stages:
- Choose the sampling method: Deciding on the technique (random, stratified, etc.).
- Determine the sample size: Calculating how many individuals are needed.
- Collect the sample data: The actual process of gathering data from the selected sample.

ChatGPT

The stage of the sampling process that refers to creating a list of all the items in the target population is:

Select the sampling frame

The sampling frame is the list of all the items (individuals, elements, units, etc.) that make up the target population from which the sample will be drawn. It serves as the basis for selecting the actual sample. Creating a comprehensive and accurate sampling frame is crucial to ensure that the sample is representative of the population and that the sampling process is valid. Therefore, selecting the sampling frame is the stage of the sampling process that involves creating this list.

Which of the following statements accurately describe non-probability sampling? Select all that apply.

Non-probability sampling is often based on convenience.
Non-probability sampling is often based on the personal preferences of the researcher.
Non-probability sampling can result in biased samples.

Gemini

Non-probability sampling is often based on convenience. True! Methods like convenience sampling select easily accessible members of the population.
Non-probability sampling is often based on the personal preferences of the researcher. Also true! Judgmental sampling, for example, relies on the researcher’s knowledge or opinion to choose participants.
Non-probability sampling can result in biased samples. Absolutely! Since selection isn’t random, non-probability sampling methods are prone to bias, potentially making results unrepresentative of the population.

Incorrect Statement:

Non-probability sampling typically uses random selection. False! Random selection is a hallmark of probability sampling, not non-probability sampling.

ChatGPT

The following statements accurately describe non-probability sampling:

Non-probability sampling is often based on convenience.
Non-probability sampling can result in biased samples.

Non-probability sampling methods do not involve random selection of elements from the population, and they do not ensure that every member of the population has an equal chance of being selected. Instead, they rely on factors such as convenience, accessibility, or the judgment of the researcher. This can lead to biased samples that may not accurately represent the population. Therefore, non-probability sampling is often based on convenience and can result in biased samples.

The statement “Non-probability sampling is often based on the personal preferences of the researcher” is incorrect. While non-probability sampling methods may involve the judgment or discretion of the researcher, it is not necessarily based on their personal preferences. The focus is typically on practical considerations such as accessibility or feasibility rather than personal preferences.

The statement “Non-probability sampling typically uses random selection” is also incorrect. Non-probability sampling methods do not typically use random selection; instead, they rely on non-random methods such as convenience sampling or purposive sampling.

Which sampling method involves dividing a population into groups and randomly selecting some members from each group for the sample?

Stratified random sampling

Gemini

The correct answer is Stratified random sampling. Here’s why:

Stratified Random Sampling:
- Population is divided into groups (strata) based on shared characteristics.
- Random selection is applied within each group.
- Ensures representation of important subgroups within the sample.

The other options don’t fit the description:

Systematic Random Sampling: Selects members at fixed intervals from an ordered list.
Cluster Random Sampling: Divides the population into clusters, then randomly selects entire clusters for the sample.
Simple Random Sampling: Each member of the population has an equal chance of being selected.

ChatGPT

The sampling method that involves dividing a population into groups and randomly selecting some members from each group for the sample is:

Cluster random sampling

In cluster random sampling, the population is divided into clusters or groups, and then a random sample of clusters is selected. All members within the selected clusters are included in the sample. This method is particularly useful when it’s impractical or costly to obtain a complete list of the population elements, as it allows for a more efficient sampling process by sampling groups rather than individual elements.

The instructor of a fitness class asks their regular students to take an online survey about the quality of the class. What sampling method does this scenario refer to?

Voluntary response sampling

Gemini

This scenario best aligns with Convenience Sampling. Here’s why:

Convenience sampling: Selects participants based on their accessibility and ease of contact. In this case, the instructor is surveying only their regular students, who are the most convenient group to sample.

Let’s analyze why the other options are less fitting:

Purposive sampling: Targets participants with specific characteristics relevant to the study. The instructor isn’t prioritizing certain traits among their regular students.
Voluntary response sampling: Individuals self-select to participate. While the survey is online, the instructor is actively asking their students to take it, not relying on volunteers.
Snowball sampling: Existing participants help recruit others. This doesn’t apply to the scenario.

Important Note: Convenience sampling often introduces biases, as the sample might not be representative of the broader population (e.g., all potential fitness class participants).

ChatGPT

The scenario where the instructor of a fitness class asks their regular students to take an online survey about the quality of the class refers to:

Convenience sampling

Convenience sampling involves selecting individuals who are easily accessible or readily available to participate in the study. In this scenario, the instructor selects their regular students, who are readily available and accessible to participate in the survey. Therefore, it aligns with the concept of convenience sampling.

Claude

The scenario you described refers to voluntary response sampling.

In voluntary response sampling, the researcher selects participants who volunteer to be part of the study. In this case, the instructor is asking the regular students to take an online survey, and those students who choose to participate are self-selecting into the sample.

Voluntary response sampling is a type of non-probability sampling, where the participants are not randomly selected. This can introduce bias into the sample, as those who choose to respond may differ in significant ways from those who do not.

The other options you provided are not accurate for this scenario:

Purposive sampling involves the researcher deliberately selecting participants based on specific criteria.
Convenience sampling involves selecting participants who are easily accessible to the researcher.
Snowball sampling involves the researcher asking initial participants to recruit additional participants from their social networks.

So in summary, the sampling method used in this fitness class survey scenario is voluntary response sampling.

Fill in the blank: Standard error measures the _____ of a sampling distribution.

standard deviation

Gemini

The correct answer is standard deviation.

Here’s why:

Sampling Distribution: A distribution of a sample statistic (like the mean) calculated from multiple samples drawn from the same population.
Standard Error: The standard deviation of this sampling distribution. It measures how much variability you’d expect in the sample statistic if you took multiple samples.

ChatGPT

Fill in the blank: Standard error measures the standard deviation of a sampling distribution.

What concept states that the sampling distribution of the mean approaches a normal distribution as the sample size increases?

Central limit theorem

Gemini

The correct answer is the Central Limit Theorem.

Here’s why:

Central Limit Theorem (CLT): A fundamental theorem in statistics that states that, regardless of the original distribution of the population, the sampling distribution of the mean will approach a normal distribution as the sample size (n) increases.
Importance: The CLT allows us to make inferences about the population mean even when we don’t know the population distribution.

The other options aren’t related to this concept:

Standard Error: Measures the variability of the sampling distribution of the mean.
Sampling Frame: The list from which a sample is drawn.
Bayes’ Theorem: Used for updating probabilities based on new evidence.

ChatGPT

The concept that states that the sampling distribution of the mean approaches a normal distribution as the sample size increases is:

Central limit theorem

A data professional is working with data about annual household income. They want to use Python to simulate taking a random sample of income values from the dataset. They write the following code: sample(n=100, replace=True, random_state=230). What is the sample size of the random sample?

100

Gemini

The correct answer is 100.

Here’s how we know:

sample(n=100, …): The n=100 parameter determines the sample size. This means the code will select 100 income values randomly.

Let me know if you’d like a breakdown of the other parameters in the code (replace and random_state)!

ChatGPT

The sample size of the random sample is indicated by the parameter n, which is specified as 100 in the code provided. Therefore, the sample size is 100.

Home » Google Career Certificates » Google Advanced Data Analytics Professional Certificate » The Power of Statistics » Module 3: Sampling

Introduction to sampling

Video: Welcome to module 3

Video: Cliff: Value everyone’s contributions

Video: Introduction to sampling

Reading: The relationship between sample and population

Population and Sample

Example: election poll

Key takeaways

Video: The sampling process

Reading: The stages of the sampling process

The sampling process

Key takeaways

Video: Compare sampling methods

Reading: Probability sampling methods

Video: The impact of bias in sampling

Reading: Non-probability sampling methods

Non-probability sampling methods

Key takeaways

Practice Quiz: Test your knowledge: Introduction to sampling

Sampling distributions

Video: How sampling affects your data

Video: The central limit theorem

Reading: Infer population parameters with the central limit theorem

The central limit theorem

Key takeaways

Video: The sampling distribution of the proportion

Reading: The sampling distribution of the mean

Sampling distribution of the sample mean

Key takeaways

Practice Quiz: Test your knowledge: Sampling distributions

Work with sampling distributions in Python

Lab: Annotated follow-along guide: Sampling distributions with Python

Data dictionary

Video: Sampling distributions with Python

Lab: Activity: Explore sampling

Lab: Exemplar: Explore sampling

Practice Quiz: Test your knowledge: Work with sampling distributions in Python

Review: Sampling

Video: Wrap-up

Reading: Glossary terms from module 3

Quiz: Module 3 challenge

Share this:

Like this: