Skip to content
Home » Google Career Certificates » Google Data Analytics Professional Certificate » Process Data from Dirty to Clean » Module 1: The importance of integrity

Module 1: The importance of integrity

As you start thinking about how to prepare your data for exploration, this part of the course will highlight why data integrity is so essential to successful decision-making. You’ll learn about how data is generated and the techniques analysts use to decide what data to collect for analysis. And you’ll discover structured and unstructured data, data types, and data formats.

Learning Objectives

  • Describe statistical measures associated with data integrity including statistical power, hypothesis testing, and margin of error
  • Describe strategies that can be used to address insufficient data
  • Discuss the importance of sample size with reference to sample bias and random samples
  • Describe the relationship between data and related business objectives
  • Define data integrity with reference to types and risks
  • Discuss the importance of pre-cleaning activities

Focus on integrity


Video: Introduction to focus on integrity

Key Points:

  • Clean data is crucial for accurate analysis and decision-making.
  • We’ll explore data integrity, sampling techniques, data testing, and cleaning methods.
  • Learn to clean data in spreadsheets, databases, and with SQL.
  • Verify and report your cleaning results for transparency and future reference.
  • Clean data skills are a valuable asset for any data analyst resume.

Hi! Good to see you! My name is Sally, and I’m here to
teach you all about processing data. I’m a measurement and
analytical lead at Google. My job is to help advertising agencies and
companies measure success and analyze their data, so I get to meet with lots of
different people to show them how data analysis helps
with their advertising. Speaking of analysis, you did great
earlier learning how to gather and organize data for analysis. It’s definitely an important step in
the data analysis process, so well done! Now let’s talk about how to make sure
that your organized data is complete and accurate. Clean data is the key to making sure your
data has integrity before you analyze it. We’ll show you how to make sure
your data is clean and tidy. Cleaning and processing data is one part
of the overall data analysis process. As a quick reminder,
that process is Ask, Prepare, Process, Analyze, Share, and Act. Which means it’s time for
us to explore the Process phase, and I’m here to guide you the whole way. I’m very familiar with
where you are right now. I’d never heard of data analytics until
I went through a program similar to this one. Once I started making progress, I realized
how much I enjoyed data analytics and the doors it could open. And now I’m excited to help
you open those same doors! One thing I realized as I worked for different companies, is that clean
data is important in every industry. For example, I learned early in my career
to be on the lookout for duplicate data, a common problem that analysts
come across when cleaning. I used to work for a company that had
different types of subscriptions. In our data set, each user would have
a new row for each subscription type they bought, which meant users would
show up more than once in my data. So if I had counted the number of users
in a table without accounting for duplicates like this, I would have
counted some users twice instead of once. As a result, my analysis would have been
wrong, which would have led to problems in my reports and for the stakeholders
relying on my analysis. Imagine if I told the CEO that we
had twice as many customers as we actually did!? That’s why clean data is so important. So the first step in processing data
is learning about data integrity. You will find out what
data integrity is and why it is important to maintain it
throughout the data analysis process. Sometimes you might not even
have the data that you need, so you’ll have to create it yourself. This will help you learn how sample size
and random sampling can save you time and effort. Testing data is another important
step to take when processing data. We’ll share some guidance on how to
test data before your analysis officially begins. Just like you’d clean your clothes and
your dishes in everyday life, analysts clean their data all the time,
too. The importance of clean data
will definitely be a focus here. You’ll learn data cleaning techniques for
all scenarios, along with some pitfalls to watch out for
as you clean. You’ll explore data cleaning in
both spreadsheets and databases, building on what you’ve already
learned about spreadsheets. We’ll talk more about SQL and
how you can use it to clean data and do other useful things, too. When analysts clean their data, they do
a lot more than a spot check to make sure it was done correctly. You’ll learn ways to verify and
report your cleaning results. This includes documenting your
cleaning process, which has lots of benefits that we’ll explore. It’s important to remember that processing
data is just one of the tasks you’ll complete as a data analyst. Actually, your skills with cleaning data
might just end up being something you highlight on your resume
when you start job hunting. Speaking of resumes, you’ll be able
to start thinking about how to build your own from the perspective
of a data analyst. Once you’re done here, you’ll have
a strong appreciation for clean data and how important it is in
the data analysis process. So let’s get started!

Data integrity and analytics objectives


Video: Why data integrity is important

Key Points:

  • Clean data, strong analysis: Data integrity (accuracy, completeness, consistency, trustworthiness) is crucial for reliable results.
  • Risks to integrity: Replication, transfer, manipulation, human error, viruses, and system failures can all compromise data.
  • Good news: In many companies, data warehouses or engineers handle data integrity.
  • Analyst’s role: Double-check data completeness and validity before analysis for accurate conclusions.

Remember: Checking data integrity is a vital step to avoid basing your analysis on shaky ground. Stay tuned for deeper insights!

Data is the lifeblood of modern decision-making. From businesses to healthcare, every field relies on accurate and reliable data to make informed choices. But what happens when the data itself is flawed? That’s where data integrity comes in.

What is Data Integrity?

Data integrity refers to the accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle. In simpler terms, it’s about ensuring your data is reliable and reflects reality. Think of it like building a house: if the foundation (data) is shaky, the whole structure (analysis and decisions) will be unstable.

Why is Data Integrity Important?

The consequences of compromised data can be far-reaching:

  • Misleading decisions: Imagine basing a marketing campaign on inaccurate customer data. You might target the wrong audience, wasting resources and damaging your brand.
  • Financial losses: Incorrect inventory data could lead to overstocking or understocking, resulting in financial losses.
  • Ethical concerns: In healthcare, faulty data could lead to misdiagnoses or improper treatment, impacting patient well-being.

Common Data Integrity Threats:

Several factors can compromise data integrity:

  • Human error: Typos, data entry mistakes, and incorrect calculations can all introduce errors.
  • Technical issues: Software bugs, hardware failures, and network problems can corrupt data.
  • Security breaches: Cyberattacks and unauthorized access can manipulate or steal data.
  • Data manipulation: Intentionally altering data for personal gain or to skew results can have severe consequences.

Maintaining Data Integrity:

Fortunately, there are ways to safeguard data integrity:

  • Data validation: Implement checks to ensure data accuracy and completeness at every stage.
  • Data cleaning: Identify and remove errors or inconsistencies in existing data.
  • Data backup and recovery: Regularly back up data to minimize the impact of data loss.
  • Security measures: Implement robust security protocols to protect data from unauthorized access and manipulation.
  • Documentation: Document data collection, storage, and processing procedures to ensure transparency and auditability.

Investing in Data Integrity:

While maintaining data integrity requires effort, the benefits are substantial:

  • Improved decision-making: Reliable data leads to betterinformed choices, enhancing business processes and outcomes.
  • Enhanced efficiency: Clean and consistent data streamlines analysis, saving time and resources.
  • Increased trust: By ensuring data reliability, you build trust with stakeholders and customers.

Conclusion:

Data integrity is not an option, it’s a necessity in today’s data-driven world. By proactively safeguarding data integrity, you can ensure your analysis is based on solid ground, leading to trustworthy insights and informed decisions. Remember, your data is only as good as its integrity – treat it with the care it deserves!

Bonus: You can explore further resources like specific data validation techniques, data cleaning tools, and best practices for secure data storage to learn more about practical implementation.

This tutorial provides a basic framework for understanding data integrity and its importance. Feel free to customize and expand it with more specific examples and detailed techniques relevant to your field of interest.

Welcome back. In this video, we’re going to discuss
data integrity and some risks you might run
into as a data analyst. A strong analysis depends on
the integrity of the data. If the data you’re using
is compromised in any way, your analysis won’t be as
strong as it should be. Data integrity is the
accuracy, completeness, consistency, and
trustworthiness of data throughout its lifecycle. That might sound like a lot of qualities for the
data to live up to. But trust me, it’s worth
it to check for them all before proceeding
with your analysis. Otherwise, your analysis
could be wrong. Not because you did
something wrong, but because the data
you were working with was wrong to begin with. When data integrity is low, it can cause anything from
the loss of a single pixel in an image to an incorrect
medical decision. In some cases, one missing piece can make
all of your data useless. Data integrity can be compromised in lots
of different ways. There’s a chance data can be compromised every
time it’s replicated, transferred, or
manipulated in any way. Data replication
is the process of storing data in
multiple locations. If you’re replicating data at different times in
different places, there’s a chance your
data will be out of sync. This data lacks integrity because different
people might not be using the same data for their findings, which can
cause inconsistencies. There’s also the issue
of data transfer, which is the process of copying data from a storage device to memory, or from one
computer to another. If your data transfer
is interrupted, you might end up with
an incomplete data set, which might not be
useful for your needs. The data manipulation
process involves changing the data to make it more
organized and easier to read. Data manipulation
is meant to make the data analysis
process more efficient, but an error during the process can compromise the efficiency. Finally, data can
also be compromised through human error, viruses, malware, hacking,
and system failures, which can all lead to
even more headaches. I’ll stop there. That’s enough potentially
bad news to digest. Let’s move on to some
potentially good news. In a lot of companies,
the data warehouse or data engineering team takes care of ensuring data integrity. Coming up, we’ll
learn about checking data integrity as a data analyst. But rest assured, someone else will usually
have your back too. After you’ve found out what
data you’re working with, it’s important to double-check
that your data is complete and valid
before analysis. This will help ensure
that your analysis and eventual conclusions
are accurate. Checking data integrity
is a vital step in processing your data to
get it ready for analysis, whether you or someone else
at your company is doing it. Coming up, you’ll learn even more about data integrity.
See you soon!

Reading: More about data integrity and compliance

Reading

Video: Balancing objectives with data integrity

Key Points:

  • Data integrity is crucial, but aligning data with business objectives adds another layer of complexity.
  • Matching data tables to specific questions is straightforward, but limitations like uncleanliness, duplicates, and lack of data can affect analysis.
  • Incomplete data is like an unclear picture – you won’t see the whole story. Trusting solely on data format can be misleading.
  • Real-life example: Analyst navigates limited delivery data by collaborating with engineers to improve tracking and achieve faster, happier customer experience.
  • Learning to handle data limitations while aiming for objectives is key to a successful data analyst career.

Next: Deeper dive into aligning data with objectives.

Bonus: The “picture” analogy vividly portrays the importance of complete data for accurate analysis.

Welcome, data enthusiasts! Today, we tackle a crucial tightrope walk: balancing your business objectives with the unwavering need for data integrity. Like a skilled acrobat, we’ll navigate the delicate equilibrium between getting answers and maintaining trustworthy results.

Why the Balancing Act?

Imagine this: You’re tasked with boosting e-commerce sales. You dive into analytics, but the data’s riddled with missing values and duplicate entries. What do you do? Rush an analysis based on shaky ground, potentially misleading the company, or delay insights while meticulously cleaning the data, risking missed sales opportunities?

The Balancing Equation:

Here’s the key: achieving objectives without compromising data integrity. It’s a constant negotiation, and mastering it sets you apart as a top-notch data analyst.

Strategies for Success:

  • Clarity of Objectives: Define your goals precisely. Are you aiming for long-term trends or short-term insights? This guides your data selection and analysis methods.
  • Know Your Data: Assess your data’s strengths and weaknesses. What limitations exist? Incomplete records? Biases? Understanding these limitations informs your analysis and interpretation.
  • Transparency is Key: Communicate data limitations upfront. Don’t shy away from caveats and uncertainties. Clear communication builds trust and avoids misinterpretations.
  • Embrace Iterations: Analyze, assess, adapt. Don’t see data cleaning as a one-time chore. Be prepared to refine your data and analysis as needed to reach trustworthy conclusions.
  • Alternative Data Sources: Sometimes, your existing data might not be the answer. Explore alternative sources that, when combined with your current data, can fill in gaps and paint a clearer picture.
  • Collaborate with Data Warriors: Seek input from data engineers and specialists. Their expertise can help optimize data cleaning and access specialized tools.

Real-World Example:

Remember the e-commerce scenario? Instead of rushing an analysis with dirty data, you collaborate with the data team to clean and deduplicate records. You discover a segment of loyal customers responsible for a significant portion of sales. Armed with this insight, you propose targeted marketing campaigns, boosting sales without compromising data integrity.

Remember: Balancing objectives with data integrity is a continuous dance. But with the right tools and mindset, you can become a data acrobat, weaving insights from even the messiest data sets, while keeping your footing on the solid ground of data integrity.

Bonus Tips:

  • Document your data limitations and cleaning procedures for future reference and transparency.
  • Stay updated on data quality tools and techniques to continuously improve your data skills.
  • Practice communicating data limitations and uncertainties effectively to stakeholders.

So, step onto the tightrope, fellow data analysts! With confidence, collaboration, and a relentless pursuit of trustworthy insights, you can master the art of balancing objectives with data integrity.

This tutorial provides a framework for understanding the balancing act between achieving objectives and maintaining data integrity. Feel free to customize it with specific examples from your field and incorporate more detailed recommendations on data cleaning techniques and alternative data sources. Let’s all strive to be data acrobats, weaving impactful insights from the intricate dances of objectives and data integrity!

Hey there, it’s good to remember
to check for data integrity. It’s also important to check that
the data you use aligns with the business objective. This adds another layer to
the maintenance of data integrity because the data you’re using might have
limitations that you’ll need to deal with. The process of matching data to business
objectives can actually be pretty straightforward. Here’s a quick example.
Let’s say you’re an analyst for a business that produces and
sells auto parts. If you need to address a question about
the revenue generated by the sale of a certain part, then you’d pull up
the revenue table from the data set. If the question is about customer reviews, then you’d pull up the reviews table
to analyze the average ratings. But before digging into any analysis, you need to consider a few
limitations that might affect it. If the data hasn’t been cleaned properly,
then you won’t be able to use it yet. You would need to wait until
a thorough cleaning has been done. Now, let’s say you’re trying to find
how much an average customer spends. You notice the same customer’s data
showing up in more than one row. This is called duplicate data. To fix this, you might need to
change the format of the data, or you might need to change the way
you calculate the average. Otherwise, it will seem like the data
is for two different people, and you’ll be stuck with
misleading calculations. You might also realize there’s not enough
data to complete an accurate analysis. Maybe you only have a couple
of months’ worth of sales data. There’s slim chance you could
wait for more data, but it’s more likely that you’ll
have to change your process or find alternate sources of data
while still meeting your objective. I like to think of
a data set like a picture. Take this picture. What are we looking at? Unless you’re an expert traveler
or know the area, it may be hard to pick out
from just these two images. Visually, it’s very clear when we
aren’t seeing the whole picture. When you get the complete picture,
you realize… you’re in London! With incomplete data, it’s hard to see the whole picture to
get a real sense of what is going on. We sometimes trust data because if it
comes to us in rows and columns, it seems like everything we need is there if we
just query it. But that’s just not true. I remember a time when I found
out I didn’t have enough data and had to find a solution. I was working for
an online retail company and was asked to figure out how to shorten
customer purchase to delivery time. Faster delivery times usually
lead to happier customers. When I checked the data set,
I found very limited tracking information. We were missing some pretty key details. So the data engineers and I created new
processes to track additional information, like the number of stops in a journey. Using this data, we reduced the time
it took from purchase to delivery and saw an improvement in customer
satisfaction. That felt pretty great! Learning how to deal with data issues
while staying focused on your objective will help set you up for success in
your career as a data analyst. And your path to success continues. Next step, you’ll learn more about
aligning data to objectives. Keep it up!

Reading: Well-aligned objectives and data

Practice Quiz: Test your knowledge on data integrity and analytics objectives

Which of the following principles are key elements of data integrity? Select all that apply.

Which process do data analysts use to make data more organized and easier to read?

Before analysis, a company collects data from countries that use different date formats. Which of the following updates would improve the data integrity?

Overcoming the challenges of insufficient data


Video: Dealing with insufficient data

Key Points:

  • Insufficient data can hinder reliable analysis, even with the vast amount available.
  • Setting limits: Define the scope and required data for your analysis beforehand.
  • Real-world example: Forecasting support tickets needed a multi-year data window to account for seasonality.
  • Common limitations and solutions:
    • Limited source: Combine data from multiple sources for broader insights.
    • Updating data: Wait for data to settle or adjust analysis objective (e.g., weekly trends instead of monthly).
    • Outdated data: Find a newer, relevant data set.
    • Geographically-limited data: Use global data for global analyses.
  • Strategies: Identify trends within existing data, wait for more data, consult stakeholders, or find new data sources.
  • Mastering these skills sets you up for success as a data analyst.

Bonus: Remember, the specific approach depends on your role and industry needs.

Next: Learn about statistical power, another valuable tool for data analysis.

This summary captures the key takeaways from the video, highlighting challenges and solutions for handling insufficient data in various scenarios.

️‍♀️ Data Detective: Handling Insufficient Data Like a Pro

Welcome to the world of data analysis, where even an abundance of information doesn’t always guarantee a smooth ride. Sometimes, you’ll face the challenge of insufficient data. But worry not, fellow analyst! I’m here to guide you through the maze of missing data and help you emerge with reliable insights.

Here’s your toolkit for navigating this dilemma:

1. Scope and Limits: Your Analysis Blueprint

  • Define your business objective clearly: What questions are you aiming to answer?
  • Determine the essential data elements: What information is crucial for drawing meaningful conclusions?
  • Set boundaries for your analysis: What timeframe and geographical scope align with your objective?

2. Common Data Limitations and Solutions:

  • Limited Source:
    • Seek additional sources to broaden your data pool.
    • Combine data from different platforms or providers, ensuring compatibility and quality.
  • Updating Data:
    • If possible, wait for more data to accumulate before conducting analysis.
    • Adjust your analysis objective to accommodate the available data (e.g., analyze weekly trends instead of monthly).
  • Outdated Data:
    • Find a newer, more relevant dataset that reflects current trends.
  • Geographically-Limited Data:
    • Expand your scope to include a wider geographical range if applicable.

3. Strategies for Action:

  • Identify Trends with Available Data:
    • Look for patterns and tendencies within the limited dataset, acknowledging any potential biases.
  • Wait for More Data (if time allows):
    • If feasible, postpone analysis until sufficient data is available.
  • Talk with Stakeholders and Adjust Objective:
    • Collaborate with stakeholders to explore alternative objectives that can be addressed with the existing data.
  • Look for a New Dataset:
    • Seek external or internal sources for a dataset that meets your analysis requirements.

4. Transparency and Communication:

  • Acknowledge limitations upfront: Be transparent with stakeholders about any constraints in the data.
  • Express confidence intervals: State the range within which you’re confident about your results.
  • Highlight assumptions and potential biases: Indicate any assumptions made during analysis and potential biases in the data.

Remember:

  • Data analysis is an iterative process: Be prepared to adapt your approach as you encounter challenges.
  • Consult with experts: Seek guidance from experienced analysts or statisticians when needed.
  • Embrace uncertainty: Accept that some level of uncertainty is inherent in data analysis, and strive for the most reliable insights possible.

Practice with these tips, and you’ll become a master of dealing with insufficient data, ensuring the integrity and value of your analysis. Stay curious, stay adaptable, and keep those data detective skills sharp!

If you have to complete your analysis with insufficient data, how should you address this limitation?

Identify trends with the available data

If you have insufficient data, you can identify trends with the data that is available and qualify your findings accordingly.

Every analyst has been in a situation
where there is insufficient data to help with their business objective. Considering how much data is generated
every day, it may be hard to believe, but it’s true. So let’s discuss what you can do
when you have insufficient data. We’ll cover how to set limits for
the scope of your analysis and what data you should include. At one point,
I was a data analyst at a support center. Every day, we received customer questions,
which were logged in as support tickets. I was asked to forecast the number of
support tickets coming in per month to figure out how many additional
people we needed to hire. It was very important that we had
sufficient data spanning back at least a couple of years because I had to account
for year-to-year and seasonal changes. If I just had the current
year’s data available, I wouldn’t have known that
a spike in January is common and has to do with people asking for
refunds after the holidays. Because I had sufficient data, I was able to suggest we hire more
people in January to prepare. Challenges are bound to come up, but the good news is that once
you know your business objective, you’ll be able to recognize
whether you have enough data. And if you don’t, you’ll be able to deal
with it before you start your analysis. Now, let’s check out some of those
limitations you might come across and how you can handle different
types of insufficient data. Say you’re working in
the tourism industry, and you need to find out which travel
plans are searched most often. If you only use data
from one booking site, you’re limiting yourself to
data from just one source. Other booking sites might show different
trends that you would want to consider for your analysis. If a limitation like this impacts
your analysis, you can stop and go back to your stakeholders
to figure out a plan. If your data set keeps updating,
that means the data is still incoming and might not be complete. So if there’s a brand new tourist
attraction that you’re analyzing interest and attendance for, there’s probably not
enough data for you to determine trends. For example, you might want to
wait a month to gather data. Or you can check in with the stakeholders
and ask about adjusting the objective. For example, you might analyze trends from
week to week instead of month to month. You could also base your analysis on
trends over the past three months and say, “Here’s what attendance at the attraction
for month four could look like.” You might not have enough data to know
if this number is too low or too high. But you would tell stakeholders that it’s
your best estimate based on the data that you currently have. On the other hand, your data could
be older and no longer be relevant. Outdated data about customer satisfaction
won’t include the most recent responses. So you’ll be relying on the ratings for
hotels or vacation rentals that might
no longer be accurate. In this case, your best bet might be
to find a new data set to work with. Data that’s geographically-limited
could also be unreliable. If your company is global, you wouldn’t
want to use data limited to travel in just one country. You would want
a data set that includes all countries. So that’s just a few of the most common
limitations you’ll come across and some ways you can address them. You can identify trends with the available
data or wait for more data if time allows; you can talk with stakeholders and
adjust your objective; or you can look for a new data set. The need to take these steps will
depend on your role in your company and possibly the needs of the wider industry. But learning how to deal with insufficient
data is always a great way to set yourself up for success. Your data analyst powers are growing
stronger. And just in time. After you learn more about limitations and
solutions, you’ll learn about statistical power,
another fantastic tool for you to use. See you soon!

Reading: What to do when you find an issue with your data

Reading

Video: The importance of sample size

Key Points:

  • Samples vs. Populations: Analyzing populations (all data values) is ideal, but often impractical.
  • Sample Size: Using a representative portion of the population to draw conclusions about the whole.
  • Benefits: Faster, cheaper, still insightful.
  • Trade-off: Uncertainty – conclusions may not perfectly reflect the population.
  • Sampling Bias: Unequal representation of population groups can skew results.
  • Random Sampling: Selecting samples with equal chance for each member minimizes bias.
  • Sample Size Planning: Done before data analysis to ensure representativeness.

Remember: Samples offer powerful insights while saving time and resources. Choosing them wisely through random sampling helps mitigate uncertainty and maintain data integrity.

Next: Dive deeper into sample size calculations and applications!

Imagine you’re on a quest to understand the preferences of dog owners in Los Angeles. You could try interviewing every single dog owner – a monumental task! Luckily, data analysis offers a secret weapon: sample size.

Why Sample Size Matters:

Think of the entire population of dog owners in LA as a vast land you want to explore. Analyzing every person (your population) would be ideal, but who has the time or resources? This is where the sample size comes in – your trusty mini-me of the population. It’s a smaller, manageable group chosen to represent the larger whole. By analyzing this mini-me, you can draw insights about the entire population.

Benefits of Sample Size:

  • Efficiency: Saves time and resources compared to analyzing the entire population.
  • Feasibility: Makes large-scale studies possible.
  • Cost-effectiveness: Saves money compared to comprehensive data collection.

The Trade-off: Uncertainty:

While convenient, samples come with a slight hitch: uncertainty. Your mini-me might not perfectly reflect the larger population. Imagine drawing a random sample of just five dog owners – their favorite kibble may not represent the preferences of all LA dog owners.

Minimizing Uncertainty:

Here’s how we can make our trusty mini-me a more accurate reflection:

  • Sample Size Calculation: Using statistical formulas, we can determine the minimum sample size needed for reliable results. Think of it as finding the perfect miniature size for your landscape model.
  • Random Sampling: Choosing your mini-me members randomly ensures everyone has an equal chance of being included, reducing bias and making your sample more representative. Imagine blindly picking names from a hat – everyone gets a fair shot!
  • Stratification: Dividing the population into subgroups (e.g., dog breeds) and then randomly sampling from each subgroup ensures all groups are represented proportionally. It’s like creating miniature versions of each neighborhood in your landscape model.

Mastering Sample Size:

By understanding the trade-off between efficiency and uncertainty, and by employing best practices like sample size calculation and random sampling, you can harness the power of samples to draw valuable insights from even the mightiest populations.

Remember: Sample size is a crucial tool in any data analyst’s arsenal. Use it wisely, and you’ll unlock the secrets of extracting accurate and actionable insights from the vast world of data.

Next Steps:

  • Dive deeper into sample size calculations and explore different sampling techniques.
  • Practice applying sample size concepts to real-world data analysis scenarios.
  • Become a sample size master and conquer the uncertainty monster!

This tutorial provides a high-level overview of the importance of sample size in data analysis. It highlights the benefits, trade-offs, and strategies for minimizing uncertainty, encouraging viewers to delve deeper into the topic and hone their sampling skills.

What are some of the possible challenges associated with using 100% of a population in data analysis? Select all that apply.

Using 100% of a population is expensive. Using 100% of a population is time-consuming.

Using 100% of a population is time-consuming and expensive.

Okay, earlier we
talked about having the right kind of data to
meet your business objective and the importance of having
the right amount of data to make sure your analysis is
as accurate as possible. You might remember that
for data analysts, a population is all
possible data values in a certain dataset. If you’re able to
use 100 percent of a population in your
analysis, that’s great. But sometimes collecting
information about an entire population
just isn’t possible. It’s too time-consuming
or expensive. For example, let’s say
a global organization wants to know more about
pet owners who have cats. You’re tasked with finding
out which kinds of toys cat owners in Canada prefer. But there’s millions of
cat owners in Canada, so getting data from all of them would be a huge challenge. Fear not! Allow me to
introduce you to… sample size! When you use sample
size or a sample, you use a part of a population that’s representative
of the population. The goal is to get
enough information from a small group within a population to make predictions or conclusions about
the whole population. The sample size helps ensure the degree
to which you can be confident that your conclusions accurately represent
the population. For the data on cat owners, a sample size might
contain data about hundreds or thousands of
people rather than millions. Using a sample for
analysis is more cost-effective and
takes less time. If done carefully
and thoughtfully, you can get the
same results using a sample size instead
of trying to hunt down every single cat owner to find out their
favorite cat toys. There is a potential
downside, though. When you only use a small
sample of a population, it can lead to uncertainty. You can’t really be 100 percent sure that your statistics are a complete and accurate
representation of the population. This leads to sampling bias, which we covered
earlier in the program. Sampling bias is when a sample isn’t representative of
the population as a whole. This means some members
of the population are being overrepresented
or underrepresented. For example, if the survey
used to collect data from cat owners only included
people with smartphones, then cat owners who don’t have a smartphone wouldn’t be
represented in the data. Using random sampling can help address some of those
issues with sampling bias. Random sampling is a way of selecting a sample
from a population so that every possible type of the sample has an equal
chance of being chosen. Going back to our
cat owners again, using a random sample of
cat owners means cat owners of every type have an equal
chance of being chosen. Cat owners who live in
apartments in Ontario would have the same chance of
being represented as those who live in
houses in Alberta. As a data analyst, you’ll find that creating
sample sizes usually takes place before you
even get to the data. But it’s still good
for you to know that the data you are
going to analyze is representative of the population and works with your objective. It’s also good to know what’s coming up in your data journey. In the next video, you’ll
have an option to become even more comfortable with sample
sizes. See you there.

Reading: Calculating sample size

Reading

Practice Quiz: Test your knowledge on insufficient data

What should an analyst do if they do not have the data needed to meet a business objective? Select all that apply.

Which of the following are limitations that might lead to insufficient data? Select all that apply.

A data analyst wants to find out how many people in Utah have swimming pools. It’s unlikely that they can survey every Utah resident. Instead, they survey enough people to be representative of the population. This describes what data analytics concept?

Testing your data


Video: Using statistical power

Main points:

  • Introduction: Statistical power as a “data superpower” for getting meaningful results from tests.
  • Example: Testing a milkshake ad on a sample of customers to gauge its effectiveness.
  • Larger sample size: Increases the chance of statistically significant results.
  • Statistical power: Probability of achieving significant results, typically shown as a value out of 1 (e.g., 0.6 = 60%).
  • Statistical significance: Means results are real and not due to random chance.
  • Target power: 80% or higher for reliable conclusions.
  • Second example: Testing a new milkshake flavor in limited locations, considering factors impacting power.
  • Measurable effects: Increased sales or customer numbers in the test locations.
  • Conclusion: Statistical power is a crucial tool for data analysts, even if it lacks the flashy appeal of flying.

Key takeaways:

  • Understand the concept of statistical power and its importance in testing.
  • Recognize the connection between sample size and power.
  • Interpret statistical power values and their implications for results.
  • Be aware of factors influencing power and consider them when designing tests.

This summary captures the key ideas of the video, providing a concise overview of this essential data analysis concept. It also maintains a lighthearted tone with the milkshake examples, making the topic more engaging and relatable.

Introduction:

  • What is statistical power? The probability of detecting a true effect or difference in a study, if it actually exists.
  • Why is it important? Helps ensure that your results are reliable and not due to chance, aids in designing studies with adequate sample sizes.

Key Concepts:

  • Significance level (alpha): The threshold for deciding whether a result is statistically significant (typically 0.05).
  • Effect size: The magnitude of the difference or relationship you’re studying (small, medium, or large).
  • Sample size: The number of participants or observations in your study.
  • Power calculations: Used to determine the appropriate sample size to achieve a desired level of power (typically 80% or higher).

Steps in Using Statistical Power:

  1. Define your research question and hypothesis.
  2. Determine the appropriate significance level (alpha).
  3. Estimate the expected effect size (based on previous research or pilot studies).
  4. Conduct a power analysis to calculate the required sample size.
  5. Collect data and conduct your analysis.
  6. Interpret your results in the context of statistical power.

Factors Affecting Statistical Power:

  • Sample size: Increasing sample size increases power.
  • Effect size: Larger effect sizes are easier to detect, requiring smaller sample sizes.
  • Significance level: More stringent alpha levels (e.g., 0.01 instead of 0.05) decrease power.
  • Variability in the data: Higher variability reduces power.

Tools for Power Analysis:

  • Statistical software packages: SPSS, SAS, R, G*Power.
  • Online power calculators: Available from various sources.

Best Practices:

  • Plan for power: Incorporate power analysis into study design early.
  • Aim for high power: Target 80% or higher to minimize false negatives.
  • Consider practical constraints: Balance power with resources and feasibility.
  • Report power calculations: Include in research publications for transparency.

Additional Considerations:

  • Ethical implications: Avoid unnecessarily large sample sizes that place burdens on participants.
  • Alternative approaches: Consider Bayesian methods that incorporate prior information and aren’t as reliant on sample size.

Remember: Statistical power is an essential tool for robust and reliable data analysis. By understanding and applying it correctly, you can enhance the validity and impact of your research findings.

Hey, there. We’ve all probably dreamed of having
a superpower at least once in our lives. I know I have. I’d love to be able to fly. But there’s another superpower you might
not have heard of: statistical power. Statistical power is the probability of
getting meaningful results from a test. I’m guessing that’s not a superpower
any of you have dreamed about. Still, it’s a pretty
great data superpower. For data analysts, your projects
might begin with the test or study. Hypothesis testing is a way
to see if a survey or experiment has meaningful results. Here’s an example. Let’s say you work for a restaurant chain
that’s planning a marketing campaign for their new milkshakes. You need to test the ad on a group
of customers before turning it into a nationwide ad campaign. In the test, you want to check whether
customers like or dislike the campaign. You also want to rule out any factors
outside of the ad that might lead them to say they don’t like it. Using all your customers would be
too time consuming and expensive. So, you’ll need to figure out how many
customers you’ll need to show that the ad is effective. Fifty probably wouldn’t be enough. Even if you randomly chose 50 customers, you might end up with customers
who don’t like milkshakes at all. And if that happens, you won’t be able to
measure the effectiveness of your ad in getting more milkshake orders since no
one in the sample size would order them. That’s why you need a larger sample size: so you can make sure you get a good number
of all types of people for your test. Usually, the larger the sample size,
the greater the chance you’ll have statistically significant
results with your test. And that’s statistical power. In this case, using as many customers as
possible will show the actual differences between the groups who like or dislike the ad versus people whose
decision wasn’t based on the ad at all. There are ways to accurately
calculate statistical power, but we won’t go into them here. You might need to calculate it
on your own as a data analyst. For now, you should know that statistical
power is usually shown as a value out of one. So if your statistical power is 0.6,
that’s the same thing as saying 60%. In the milk shake ad test, if you found a statistical power of 60%,
that means there’s a 60% chance of you getting a statistically significant
result on the ad’s effectiveness. “Statistically significant”
is a term that is used in statistics. If you want to learn more about the
technical meaning, you can search online. But in basic terms,
if a test is statistically significant, it means the results of the test are real
and not an error caused by random chance. So there’s a 60% chance that
the results of the milkshake ad test are reliable and real and a 40% chance that the result
of the test is wrong. Usually, you need a statistical
power of at least 0.8 or 80% to consider your results
statistically significant. Let’s check out one more scenario. We’ll stick with milkshakes because,
well, because I like milkshakes. Imagine you work for a restaurant chain
that wants to launch a brand-new birthday cake flavored milkshake. This milkshake will be more expensive
to produce than your other milkshakes. Your company hopes that the buzz around
the new flavor will bring in more customers and money to offset this cost. They want to test this out in
a few restaurant locations first. So let’s figure out how many locations
you’d have to use to be confident in your results. First, you’d have to think about
what might prevent you from getting statistically significant results. Are there restaurants running any
other promotions that might bring in new customers? Do some restaurants have customers
that always buy the newest item, no matter what it is? Do some location have construction
that recently started, that would prevent customers from
even going to the restaurant? To get a higher statistical power, you’d
have to consider all of these factors before you decide how many locations
to include in your sample size for your study. You want to make sure any effect is most
likely due to the new milkshake flavor, not another factor. The measurable effects would
be an increase in sales or the number of customers at the
locations in your sample size. That’s it for now. Coming up, we’ll explore sample
sizes in more detail, so you can get a better idea of how
they impact your tests and studies. In the meantime, you’ve gotten to know
a little bit more about milkshakes and superpowers. And of course, statistical power. Sadly, only statistical power can
truly be useful for data analysts. Though putting on my cape and flying to grab a milkshake right
now does sound pretty good.

Reading: What to do when there is no data

Reading

Video: Determine the best sample size

Main points:

  • Sample size: A representative subset of a larger population used for studies.
  • Benefits: Reduces cost and time compared to analyzing the entire population.
  • Confidence level: Probability that the sample accurately reflects the larger population (ideally 90-95%).
  • Margin of error: How close the sample results are likely to be to the population results (smaller is better).
  • Sample size calculator: Online tools help determine appropriate sample size based on population, confidence level, and margin of error.
  • Example: Calculating sample size for a student candy preference study with 500 students, 95% confidence level, and 5% margin of error – result: 218 students.
  • Key takeaway: Understanding sample size, confidence level, and margin of error is crucial for accurate data analysis.

Additional notes:

  • Emphasizes the importance of data integrity for valid results.
  • References sample size as a “data superpower” similar to the previous video on statistical power.
  • Encourages viewers to practice with online calculators for practical application.

This summary captures the key information from the video, providing a concise overview of sample size and its connection to confidence level and margin of error. The example and practical tips make the concepts relatable and easily applicable.

Introduction:

  • What is sample size? The number of participants or observations included in a study.
  • Importance:
    • Crucial for accurate and reliable results.
    • Too small: May miss important effects or relationships.
    • Too large: Wastes resources and time.

Key Factors to Consider:

  1. Confidence Level: The desired probability that your sample results accurately reflect the population (e.g., 95%, 99%).
  2. Margin of Error: The acceptable amount of difference between your sample results and the true population values (e.g., 5%, 3%).
  3. Population Size: The total number of individuals or units in the population you’re studying.
  4. Variability: The degree to which data points differ from each other. Higher variability requires larger samples.
  5. Effect Size: The magnitude of the difference or relationship you’re trying to detect. Larger effect sizes require smaller samples.
  6. Statistical Test: Different tests have different sample size requirements.
  7. Practical Constraints: Time, cost, and feasibility of data collection.

Steps to Determine Sample Size:

  1. Define Your Research Question and Hypothesis: Clearly articulate what you’re trying to investigate.
  2. Specify Confidence Level and Margin of Error: Choose levels that align with your research goals and tolerance for uncertainty.
  3. Estimate Population Variability: Use prior research, pilot studies, or expert knowledge to estimate variability.
  4. Consider Effect Size: If possible, estimate the expected effect size to refine sample size calculations.
  5. Choose a Statistical Test: Select the appropriate test based on your research question and data type.
  6. Use a Sample Size Calculator: Online calculators or statistical software can perform calculations based on input parameters.
  7. Consult with a Statistician: For complex studies or specific needs, seek guidance from a statistician.

Additional Considerations:

  • Ethical Issues: Balance statistical needs with potential burdens on participants.
  • Non-Response: Anticipate potential non-response and adjust sample size accordingly.
  • Subgroup Analysis: If planning to analyze subgroups, ensure adequate sample sizes within each group.
  • Power Analysis: Conduct a power analysis to determine the probability of detecting a true effect, given your sample size and other study parameters.

Best Practices:

  • Plan Early: Incorporate sample size determination into the early stages of study design.
  • Consider Multiple Factors: Base sample size decisions on a comprehensive assessment of factors.
  • Document Decisions: Clearly report sample size calculations and rationale in research publications.
  • Seek Expert Advice: Consult statisticians or other experts for guidance, especially for complex studies.

Remember: Determining the best sample size is a critical step in ensuring valid and meaningful results in data analysis. By carefully considering the factors involved and following best practices, you can make informed decisions that lead to accurate and reliable conclusions.

Great to see you again. In this video, we’ll
go into more detail about sample sizes
and data integrity. If you’ve ever been to a
store that hands out samples, you know it’s one of
life’s little pleasures. For me, anyway! those small samples are
also a very smart way for businesses to learn more
about their products from customers without having to
give everyone a free sample. A lot of organizations use
sample size in a similar way. They take one part
of something larger. In this case, a sample
of a population. Sometimes they’ll
perform complex tests on their data to see if it meets
their business objectives. We won’t go into all
the calculations needed to do this effectively. Instead, we’ll focus
on a “big picture” look at the process
and what it involves. As a quick reminder, sample size is a part of a population that is
representative of the population. For businesses, it’s a
very important tool. It can be both expensive and time-consuming to analyze an
entire population of data. Using sample size usually makes the most sense and can still lead to valid and
useful findings. There are handy calculators online that can help
you find sample size. You need to input the
confidence level, population size, and
margin of error. We’ve talked about
population size before. To build on that,
we’ll learn about confidence level and
margin of error. Knowing about these
concepts will help you understand why you need them
to calculate sample size. The confidence level is
the probability that your sample accurately reflects
the greater population. You can think of it the same way as confidence
in anything else. It’s how strongly
you feel that you can rely on something or someone. Having a 99 percent
confidence level is ideal. But most industries
hope for at least a 90 or 95 percent
confidence level. Industries like pharmaceuticals usually want a confidence level that’s as high as possible when they are using a sample size. This makes sense because they’re testing
medicines and need to be sure they work and are
safe for everyone to use. For other studies, organizations might
just need to know that the test or survey results have them heading in
the right direction. For example, if a paint company is testing out new colors, a lower confidence level is okay. You also want to consider the margin of error
for your study. You’ll learn more
about this soon, but it basically tells you how close your sample size
results are to what your results would be if you use the entire population that
your sample size represents. Think of it like this. Let’s say that the principal of a middle school approaches you with a study about
students’ candy preferences. They need to know an appropriate sample size, and they need it now. The school has a student
population of 500, and they’re asking for
a confidence level of 95 percent and a margin
of error of 5 percent. We’ve set up a calculator
in a spreadsheet, but you can also easily find
this type of calculator by searching “sample size
calculator” on the internet. Just like those calculators, our spreadsheet calculator
doesn’t show any of the more complex calculations for figuring out sample size. All we need to do is input the numbers for our population, confidence level,
and margin of error. And when we type 500 for
our population size, 95 for our confidence
level percentage, 5 for our margin
of error percentage, the result is about 218. That means for this study, an appropriate sample
size would be 218. If we surveyed 218 students and found that 55 percent of
them preferred chocolate, then we could be
pretty confident that would be true of
all 500 students. 218 is the minimum number of people we need
to survey based on our criteria of a 95
percent confidence level and a 5 percent margin of error. In case you’re wondering, the confidence level
and margin of error don’t have to add
up to 100 percent. They’re independent
of each other. So let’s say we change
our margin of error from 5 percent
to 3 percent. Then we find that our sample size would need to be larger, about 341 instead of 218, to make the results of the study more representative
of the population. Feel free to practice with
an online calculator. Knowing sample size and how to find it will help you
when you work with data. We’ve got more useful
knowledge coming your way, including learning about
margin of error. See you soon!

Reading: Sample size calculator

Reading

Practice Quiz: Test your knowledge on testing your data

A research team runs an experiment to determine if a new security system is more effective than the previous version. What type of results are required for the experiment to be statistically significant?

In order to have a high confidence level in a customer survey, what should the sample size accurately reflect?

A data analyst determines an appropriate sample size for a survey. They can check their work by making sure the confidence level percentage plus the margin of error percentage add up to 100%.

Consider the margin of error


Video: Evaluate the reliability of your data

Key points:

  • Definition: Margin of error is the maximum expected difference between sample results and the actual population.
  • Importance: Helps understand the reliability of sample-based data and its relevance to the entire population.
  • Calculation: Based on population size, sample size, and confidence level (e.g., 95%). Higher confidence means a lower margin of error.
  • Example: Survey about a 4-day workweek with 60% approval and 10% margin of error suggests actual population support between 50% and 70%.
  • Interpretation: Smaller margin of error (e.g., 5%) indicates more reliable results and confidence in generalizing to the population.
  • Resources: Online calculators and spreadsheets can help calculate margin of error based on your data.

Overall:

Understanding margin of error is crucial for data analysts to assess the accuracy and generalizability of their results. By calculating and interpreting the margin of error, they can make more informed conclusions and avoid misleading implications based on small-scale samples.

Additional notes:

  • The video emphasizes the connection between sample size, confidence level, and margin of error, showing how changing one affects the others.
  • Data integrity and alignment with objectives are highlighted as essential elements for reliable analysis.

This summary provides a concise overview of the video’s key points and takeaways. Remember, you can always revisit the video or the glossary for further details and exploration.

Introduction

In data analysis, the reliability of your data is paramount to producing accurate and trustworthy insights. It’s essential to assess the quality and trustworthiness of your data before embarking on any analysis to ensure your findings are valid and meaningful. This tutorial will guide you through the key steps and considerations involved in evaluating data reliability.

Key Considerations

1. Data Source and Collection Methods:

  • Credibility: Assess the reputation and expertise of the data source.
  • Transparency: Understand the data collection methods and any potential biases or limitations.
  • Consistency: Ensure data collection procedures were consistent across time and participants.

2. Data Completeness and Accuracy:

  • Missing Values: Identify and address missing data through imputation or exclusion, depending on the analysis.
  • Errors: Check for inconsistencies, outliers, or errors in the data, and rectify them as needed.
  • Validation: Compare data against external sources or expert knowledge to verify its accuracy.

3. Sampling Methods:

  • Representation: Evaluate whether the sample is representative of the population you’re studying.
  • Randomness: Ensure the sample was selected randomly to avoid selection bias.
  • Size: Consider the sample size and its impact on statistical power and margin of error.

4. Data Integrity:

  • Security: Verify that data has been protected from unauthorized access or tampering.
  • Consistency: Ensure data formatting and definitions are consistent across different sources.
  • Documentation: Review any available documentation about data collection, cleaning, and storage procedures.

5. Statistical Measures:

  • Margin of Error: Calculate the margin of error to understand the potential range of values in the population.
  • Confidence Intervals: Use confidence intervals to express the uncertainty associated with estimates.
  • Statistical Tests: Apply appropriate statistical tests to assess the significance of findings and relationships within the data.

Additional Considerations:

  • Data Bias: Be mindful of potential biases in data collection or analysis, such as selection bias, measurement bias, or confirmation bias.
  • Data Consistency: Ensure data aligns with your research objectives and questions.
  • External Validation: Consider validating findings with external data sources or expert opinions.

Conclusion

Evaluating data reliability is an essential step in any data analysis project. By carefully considering these factors, you can build confidence in your data and ensure the validity of your conclusions, leading to more informed decision-making and meaningful insights.

Hey there! Earlier, we touched on margin of
error without explaining it completely. Well, we’re going to right that wrong
in this video by explaining margin of error more. We’ll even include an example
of how to calculate it. As a data analyst, it’s important for you
to figure out sample size and variables like confidence level and margin of error
before running any kind of test or survey. It’s the best way to make sure
your results are objective, and it gives you a better chance of getting
statistically significant results. But if you already know the sample size,
like when you’re given survey results to analyze, you can calculate
the margin of error yourself. Then you’ll have a better idea of how much of
a difference there is between your sample and your population. We’ll start at the beginning
with a more complete definition. Margin of error is the maximum that
the sample results are expected to differ from those of
the actual population. Let’s think about an example
of margin of error. It would be great to survey or
test an entire population, but it’s usually impossible or
impractical to do this. So instead, we take a sample
of the larger population. Based on the sample size, the resulting margin of error will tell
us how different the results might be compared to the results if we had
surveyed the entire population. Margin of error helps you understand how
reliable the data from your hypothesis testing is. The closer to zero the margin of error,
the closer your results from your sample would match results
from the overall population. For example, let’s say you completed a nationwide
survey using a sample of the population. You asked people who work five-day
workweeks whether they like the idea of a four-day workweek. So your survey tells you that
60% prefer a four-day workweek. The margin of error was 10%, which tells us that between
50 and 70% like the idea. So if we were to survey all
five-day workers nationwide, between 50 and 70% would
agree with our results. Keep in mind that our range
is between 50 and 70%. That’s because the margin of error
is counted in both directions from the survey results of 60%. If you set up a 95% confidence
level for your survey, there’ll be a 95% chance that the
entire population’s responses will fall between 50 and 70% saying, yes,
they want a four-day workweek. Since your margin of error overlaps
with that 50% mark, you can’t say for sure that the public likes
the idea of a four-day workweek. In that case, you’d have to say
your survey was inconclusive. Now, if you wanted a lower
margin of error, say 5%, with a range between 55 and 65%,
you could increase the sample size. But if you’ve already been
given the sample size, you can calculate the margin
of error yourself. Then you can decide yourself how
much of a chance your results have of being statistically significant
based on your margin of error. In general, the more people
you include in your survey, the more likely your sample is
representative of the entire population. Decreasing the confidence level
would also have the same effect, but that would also make it less
likely that your survey is accurate. So to calculate margin of
error, you need three things: population size, sample size,
and confidence level. And just like with sample size, you can find lots of calculators online by
searching “margin of error calculator.” But we’ll show you in a spreadsheet, just like we did when we
calculated sample size. Lets say you’re running a study on
the effectiveness of a new drug. You have a sample size
of 500 participants whose condition affects 1%
of the world’s population. That’s about 80 million people,
which is the population for your study. Since it’s a drug study, you need
to have a confidence level of 99%. You also need a low margin of error. Let’s calculate it. We’ll put the numbers for population, confidence level, and sample size, in the appropriate
spreadsheet cells. And our result is a margin of error
of close to 6%, plus or minus. When the drug study is complete, you’d
apply the margin of error to your results to determine how reliable
your results might be. Calculators like this one in the
spreadsheet are just one of the many tools you can use to ensure data integrity. And it’s also good to remember that
checking for data integrity and aligning the data with your objectives will put you
in good shape to complete your analysis. Knowing about sample size,
statistical power, margin of error, and other topics we’ve covered
will help your analysis run smoothly. That’s a lot of new concepts to take in. If you’d like to review them at any time, you can find them all in the glossary,
or feel free to rewatch the video! Soon you’ll explore the ins and
outs of clean data. The data adventure keeps moving! I’m so glad you’re moving along with it. You got this!

Reading: All about margin of error

Reading

Practice Quiz: Test your knowledge on margin of error

Fill in the blank: Margin of error is the _____ amount that the sample results are expected to differ from those of the actual population.

In a survey about a new cleaning product, 75% of respondents report they would buy the product again. The margin of error for the survey is 5%. Based on the margin of error, what percentage range reflects the population’s true response?

Module 1 challenge


Reading: Glossary: Terms and definitions

Quiz: Module 1 challenge

Fill in the blank: If a data analyst is using data that has been _____, the data will lack integrity and the analysis will be faulty.

A healthcare company keeps copies of their data at several locations across the country. The data becomes compromised because each location creates a copy of the original at different times of day. Which of the following processes caused the compromise?

A data analyst is given a dataset for analysis. It includes data about the total population of every country in the previous 20 years. Based on the available data, an analyst would be able to determine the reasons behind a certain country’s population increase from 2016 to 2017.

A data analyst is given a dataset for analysis. To use the template for this dataset, click the link below and select “Use Template.”
Link to template: June 2014 Invoices

A data analyst is working on a project about the global supply chain. They have a dataset with lots of relevant data from Europe and Asia. However, they decide to generate new data that represents all continents. What type of insufficient data does this scenario describe?

When gathering data through a survey, companies can save money by surveying 100% of a population.

A restaurant wants to gather data about a new dish by giving out free samples and asking for feedback. Who should the restaurant give samples to?

Which of the following processes helps ensure a close alignment of data and business objectives?