As you start thinking about how to prepare your data for exploration, this part of the course will highlight why data integrity is so essential to successful decision-making. You’ll learn about how data is generated and the techniques analysts use to decide what data to collect for analysis. And you’ll discover structured and unstructured data, data types, and data formats.
Learning Objectives
- Describe statistical measures associated with data integrity including statistical power, hypothesis testing, and margin of error
- Describe strategies that can be used to address insufficient data
- Discuss the importance of sample size with reference to sample bias and random samples
- Describe the relationship between data and related business objectives
- Define data integrity with reference to types and risks
- Discuss the importance of pre-cleaning activities
Focus on integrity
Video: Introduction to focus on integrity
Key Points:
- Clean data is crucial for accurate analysis and decision-making.
- We’ll explore data integrity, sampling techniques, data testing, and cleaning methods.
- Learn to clean data in spreadsheets, databases, and with SQL.
- Verify and report your cleaning results for transparency and future reference.
- Clean data skills are a valuable asset for any data analyst resume.
Hi! Good to see you! My name is Sally, and I’m here to
teach you all about processing data. I’m a measurement and
analytical lead at Google. My job is to help advertising agencies and
companies measure success and analyze their data, so I get to meet with lots of
different people to show them how data analysis helps
with their advertising. Speaking of analysis, you did great
earlier learning how to gather and organize data for analysis. It’s definitely an important step in
the data analysis process, so well done! Now let’s talk about how to make sure
that your organized data is complete and accurate. Clean data is the key to making sure your
data has integrity before you analyze it. We’ll show you how to make sure
your data is clean and tidy. Cleaning and processing data is one part
of the overall data analysis process. As a quick reminder,
that process is Ask, Prepare, Process, Analyze, Share, and Act. Which means it’s time for
us to explore the Process phase, and I’m here to guide you the whole way. I’m very familiar with
where you are right now. I’d never heard of data analytics until
I went through a program similar to this one. Once I started making progress, I realized
how much I enjoyed data analytics and the doors it could open. And now I’m excited to help
you open those same doors! One thing I realized as I worked for different companies, is that clean
data is important in every industry. For example, I learned early in my career
to be on the lookout for duplicate data, a common problem that analysts
come across when cleaning. I used to work for a company that had
different types of subscriptions. In our data set, each user would have
a new row for each subscription type they bought, which meant users would
show up more than once in my data. So if I had counted the number of users
in a table without accounting for duplicates like this, I would have
counted some users twice instead of once. As a result, my analysis would have been
wrong, which would have led to problems in my reports and for the stakeholders
relying on my analysis. Imagine if I told the CEO that we
had twice as many customers as we actually did!? That’s why clean data is so important. So the first step in processing data
is learning about data integrity. You will find out what
data integrity is and why it is important to maintain it
throughout the data analysis process. Sometimes you might not even
have the data that you need, so you’ll have to create it yourself. This will help you learn how sample size
and random sampling can save you time and effort. Testing data is another important
step to take when processing data. We’ll share some guidance on how to
test data before your analysis officially begins. Just like you’d clean your clothes and
your dishes in everyday life, analysts clean their data all the time,
too. The importance of clean data
will definitely be a focus here. You’ll learn data cleaning techniques for
all scenarios, along with some pitfalls to watch out for
as you clean. You’ll explore data cleaning in
both spreadsheets and databases, building on what you’ve already
learned about spreadsheets. We’ll talk more about SQL and
how you can use it to clean data and do other useful things, too. When analysts clean their data, they do
a lot more than a spot check to make sure it was done correctly. You’ll learn ways to verify and
report your cleaning results. This includes documenting your
cleaning process, which has lots of benefits that we’ll explore. It’s important to remember that processing
data is just one of the tasks you’ll complete as a data analyst. Actually, your skills with cleaning data
might just end up being something you highlight on your resume
when you start job hunting. Speaking of resumes, you’ll be able
to start thinking about how to build your own from the perspective
of a data analyst. Once you’re done here, you’ll have
a strong appreciation for clean data and how important it is in
the data analysis process. So let’s get started!
Data integrity and analytics objectives
Video: Why data integrity is important
Key Points:
- Clean data, strong analysis: Data integrity (accuracy, completeness, consistency, trustworthiness) is crucial for reliable results.
- Risks to integrity: Replication, transfer, manipulation, human error, viruses, and system failures can all compromise data.
- Good news: In many companies, data warehouses or engineers handle data integrity.
- Analyst’s role: Double-check data completeness and validity before analysis for accurate conclusions.
Remember: Checking data integrity is a vital step to avoid basing your analysis on shaky ground. Stay tuned for deeper insights!
Data is the lifeblood of modern decision-making. From businesses to healthcare, every field relies on accurate and reliable data to make informed choices. But what happens when the data itself is flawed? That’s where data integrity comes in.
What is Data Integrity?
Data integrity refers to the accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle. In simpler terms, it’s about ensuring your data is reliable and reflects reality. Think of it like building a house: if the foundation (data) is shaky, the whole structure (analysis and decisions) will be unstable.
Why is Data Integrity Important?
The consequences of compromised data can be far-reaching:
- Misleading decisions: Imagine basing a marketing campaign on inaccurate customer data. You might target the wrong audience, wasting resources and damaging your brand.
- Financial losses: Incorrect inventory data could lead to overstocking or understocking, resulting in financial losses.
- Ethical concerns: In healthcare, faulty data could lead to misdiagnoses or improper treatment, impacting patient well-being.
Common Data Integrity Threats:
Several factors can compromise data integrity:
- Human error: Typos, data entry mistakes, and incorrect calculations can all introduce errors.
- Technical issues: Software bugs, hardware failures, and network problems can corrupt data.
- Security breaches: Cyberattacks and unauthorized access can manipulate or steal data.
- Data manipulation: Intentionally altering data for personal gain or to skew results can have severe consequences.
Maintaining Data Integrity:
Fortunately, there are ways to safeguard data integrity:
- Data validation: Implement checks to ensure data accuracy and completeness at every stage.
- Data cleaning: Identify and remove errors or inconsistencies in existing data.
- Data backup and recovery: Regularly back up data to minimize the impact of data loss.
- Security measures: Implement robust security protocols to protect data from unauthorized access and manipulation.
- Documentation: Document data collection, storage, and processing procedures to ensure transparency and auditability.
Investing in Data Integrity:
While maintaining data integrity requires effort, the benefits are substantial:
- Improved decision-making: Reliable data leads to betterinformed choices, enhancing business processes and outcomes.
- Enhanced efficiency: Clean and consistent data streamlines analysis, saving time and resources.
- Increased trust: By ensuring data reliability, you build trust with stakeholders and customers.
Conclusion:
Data integrity is not an option, it’s a necessity in today’s data-driven world. By proactively safeguarding data integrity, you can ensure your analysis is based on solid ground, leading to trustworthy insights and informed decisions. Remember, your data is only as good as its integrity – treat it with the care it deserves!
Bonus: You can explore further resources like specific data validation techniques, data cleaning tools, and best practices for secure data storage to learn more about practical implementation.
This tutorial provides a basic framework for understanding data integrity and its importance. Feel free to customize and expand it with more specific examples and detailed techniques relevant to your field of interest.
Welcome back. In this video, we’re going to discuss
data integrity and some risks you might run
into as a data analyst. A strong analysis depends on
the integrity of the data. If the data you’re using
is compromised in any way, your analysis won’t be as
strong as it should be. Data integrity is the
accuracy, completeness, consistency, and
trustworthiness of data throughout its lifecycle. That might sound like a lot of qualities for the
data to live up to. But trust me, it’s worth
it to check for them all before proceeding
with your analysis. Otherwise, your analysis
could be wrong. Not because you did
something wrong, but because the data
you were working with was wrong to begin with. When data integrity is low, it can cause anything from
the loss of a single pixel in an image to an incorrect
medical decision. In some cases, one missing piece can make
all of your data useless. Data integrity can be compromised in lots
of different ways. There’s a chance data can be compromised every
time it’s replicated, transferred, or
manipulated in any way. Data replication
is the process of storing data in
multiple locations. If you’re replicating data at different times in
different places, there’s a chance your
data will be out of sync. This data lacks integrity because different
people might not be using the same data for their findings, which can
cause inconsistencies. There’s also the issue
of data transfer, which is the process of copying data from a storage device to memory, or from one
computer to another. If your data transfer
is interrupted, you might end up with
an incomplete data set, which might not be
useful for your needs. The data manipulation
process involves changing the data to make it more
organized and easier to read. Data manipulation
is meant to make the data analysis
process more efficient, but an error during the process can compromise the efficiency. Finally, data can
also be compromised through human error, viruses, malware, hacking,
and system failures, which can all lead to
even more headaches. I’ll stop there. That’s enough potentially
bad news to digest. Let’s move on to some
potentially good news. In a lot of companies,
the data warehouse or data engineering team takes care of ensuring data integrity. Coming up, we’ll
learn about checking data integrity as a data analyst. But rest assured, someone else will usually
have your back too. After you’ve found out what
data you’re working with, it’s important to double-check
that your data is complete and valid
before analysis. This will help ensure
that your analysis and eventual conclusions
are accurate. Checking data integrity
is a vital step in processing your data to
get it ready for analysis, whether you or someone else
at your company is doing it. Coming up, you’ll learn even more about data integrity.
See you soon!
Reading: More about data integrity and compliance
Reading
This reading illustrates the importance of data integrity using an example of a global company’s data. Definitions of terms that are relevant to data integrity will be provided at the end.
Scenario: calendar dates for a global company
Calendar dates are represented in a lot of different short forms. Depending on where you live, a different format might be used.
- In some countries,12/10/20 (DD/MM/YY) stands for October 12, 2020.
- In other countries, the national standard is YYYY-MM-DD so October 12, 2020 becomes 2020-10-12.
- In the United States, (MM/DD/YY) is the accepted format so October 12, 2020 is going to be 10/12/20.
Now, think about what would happen if you were working as a data analyst for a global company and didn’t check date formats. Well, your data integrity would probably be questionable. Any analysis of the data would be inaccurate. Imagine ordering extra inventory for December when it was actually needed in October!
A good analysis depends on the integrity of the data, and data integrity usually depends on using a common format. So it is important to double-check how dates are formatted to make sure what you think is December 10, 2020 isn’t really October 12, 2020, and vice versa.
Here are some other things to watch out for:
- Data replication compromising data integrity: Continuing with the example, imagine you ask your international counterparts to verify dates and stick to one format. One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren’t verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates.
- Data transfer compromising data integrity: Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity.
- Data manipulation compromising data integrity: When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness.
Conclusion
Fortunately, with a standard date format and compliance by all people and systems that work with the data, data integrity can be maintained. But no matter where your data comes from, always be sure to check that it is valid, complete, and clean before you begin any analysis.
Reference: Data constraints and examples
As you progress in your data journey, you’ll come across many types of data constraints (or criteria that determine validity). The table below offers definitions and examples of data constraint terms you might come across.
Data constraint | Definition | Examples |
---|---|---|
Data type | Values must be of a certain type: date, number, percentage, Boolean, etc. | If the data type is a date, a single number like 30 would fail the constraint and be invalid |
Data range | Values must fall between predefined maximum and minimum values | If the data range is 10-20, a value of 30 would fail the constraint and be invalid |
Mandatory | Values can’t be left blank or empty | If age is mandatory, that value must be filled in |
Unique | Values can’t have a duplicate | Two people can’t have the same mobile phone number within the same service area |
Regular expression (regex) patterns | Values must match a prescribed pattern | A phone number must match ###-###-#### (no other characters allowed) |
Cross-field validation | Certain conditions for multiple fields must be satisfied | Values are percentages and values from multiple fields must add up to 100% |
Primary-key | (Databases only) value must be unique per column | A database table can’t have two rows with the same primary key value. A primary key is an identifier in a database that references a column in which each value is unique. More information about primary and foreign keys is provided later in the program. |
Set-membership | (Databases only) values for a column must come from a set of discrete values | Value for a column must be set to Yes, No, or Not Applicable |
Foreign-key | (Databases only) values for a column must be unique values coming from a column in another table | In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table |
Accuracy | The degree to which the data conforms to the actual entity being measured or described | If values for zip codes are validated by street location, the accuracy of the data goes up. |
Completeness | The degree to which the data contains all desired components or measures | If data for personal profiles required hair and eye color, and both are collected, the data is complete. |
Consistency | The degree to which the data is repeatable from different points of entry or collection | If a customer has the same address in the sales and repair databases, the data is consistent. |
Video: Balancing objectives with data integrity
Key Points:
- Data integrity is crucial, but aligning data with business objectives adds another layer of complexity.
- Matching data tables to specific questions is straightforward, but limitations like uncleanliness, duplicates, and lack of data can affect analysis.
- Incomplete data is like an unclear picture – you won’t see the whole story. Trusting solely on data format can be misleading.
- Real-life example: Analyst navigates limited delivery data by collaborating with engineers to improve tracking and achieve faster, happier customer experience.
- Learning to handle data limitations while aiming for objectives is key to a successful data analyst career.
Next: Deeper dive into aligning data with objectives.
Bonus: The “picture” analogy vividly portrays the importance of complete data for accurate analysis.
Welcome, data enthusiasts! Today, we tackle a crucial tightrope walk: balancing your business objectives with the unwavering need for data integrity. Like a skilled acrobat, we’ll navigate the delicate equilibrium between getting answers and maintaining trustworthy results.
Why the Balancing Act?
Imagine this: You’re tasked with boosting e-commerce sales. You dive into analytics, but the data’s riddled with missing values and duplicate entries. What do you do? Rush an analysis based on shaky ground, potentially misleading the company, or delay insights while meticulously cleaning the data, risking missed sales opportunities?
The Balancing Equation:
Here’s the key: achieving objectives without compromising data integrity. It’s a constant negotiation, and mastering it sets you apart as a top-notch data analyst.
Strategies for Success:
- Clarity of Objectives: Define your goals precisely. Are you aiming for long-term trends or short-term insights? This guides your data selection and analysis methods.
- Know Your Data: Assess your data’s strengths and weaknesses. What limitations exist? Incomplete records? Biases? Understanding these limitations informs your analysis and interpretation.
- Transparency is Key: Communicate data limitations upfront. Don’t shy away from caveats and uncertainties. Clear communication builds trust and avoids misinterpretations.
- Embrace Iterations: Analyze, assess, adapt. Don’t see data cleaning as a one-time chore. Be prepared to refine your data and analysis as needed to reach trustworthy conclusions.
- Alternative Data Sources: Sometimes, your existing data might not be the answer. Explore alternative sources that, when combined with your current data, can fill in gaps and paint a clearer picture.
- Collaborate with Data Warriors: Seek input from data engineers and specialists. Their expertise can help optimize data cleaning and access specialized tools.
Real-World Example:
Remember the e-commerce scenario? Instead of rushing an analysis with dirty data, you collaborate with the data team to clean and deduplicate records. You discover a segment of loyal customers responsible for a significant portion of sales. Armed with this insight, you propose targeted marketing campaigns, boosting sales without compromising data integrity.
Remember: Balancing objectives with data integrity is a continuous dance. But with the right tools and mindset, you can become a data acrobat, weaving insights from even the messiest data sets, while keeping your footing on the solid ground of data integrity.
Bonus Tips:
- Document your data limitations and cleaning procedures for future reference and transparency.
- Stay updated on data quality tools and techniques to continuously improve your data skills.
- Practice communicating data limitations and uncertainties effectively to stakeholders.
So, step onto the tightrope, fellow data analysts! With confidence, collaboration, and a relentless pursuit of trustworthy insights, you can master the art of balancing objectives with data integrity.
This tutorial provides a framework for understanding the balancing act between achieving objectives and maintaining data integrity. Feel free to customize it with specific examples from your field and incorporate more detailed recommendations on data cleaning techniques and alternative data sources. Let’s all strive to be data acrobats, weaving impactful insights from the intricate dances of objectives and data integrity!
Hey there, it’s good to remember
to check for data integrity. It’s also important to check that
the data you use aligns with the business objective. This adds another layer to
the maintenance of data integrity because the data you’re using might have
limitations that you’ll need to deal with. The process of matching data to business
objectives can actually be pretty straightforward. Here’s a quick example.
Let’s say you’re an analyst for a business that produces and
sells auto parts. If you need to address a question about
the revenue generated by the sale of a certain part, then you’d pull up
the revenue table from the data set. If the question is about customer reviews, then you’d pull up the reviews table
to analyze the average ratings. But before digging into any analysis, you need to consider a few
limitations that might affect it. If the data hasn’t been cleaned properly,
then you won’t be able to use it yet. You would need to wait until
a thorough cleaning has been done. Now, let’s say you’re trying to find
how much an average customer spends. You notice the same customer’s data
showing up in more than one row. This is called duplicate data. To fix this, you might need to
change the format of the data, or you might need to change the way
you calculate the average. Otherwise, it will seem like the data
is for two different people, and you’ll be stuck with
misleading calculations. You might also realize there’s not enough
data to complete an accurate analysis. Maybe you only have a couple
of months’ worth of sales data. There’s slim chance you could
wait for more data, but it’s more likely that you’ll
have to change your process or find alternate sources of data
while still meeting your objective. I like to think of
a data set like a picture. Take this picture. What are we looking at? Unless you’re an expert traveler
or know the area, it may be hard to pick out
from just these two images. Visually, it’s very clear when we
aren’t seeing the whole picture. When you get the complete picture,
you realize… you’re in London! With incomplete data, it’s hard to see the whole picture to
get a real sense of what is going on. We sometimes trust data because if it
comes to us in rows and columns, it seems like everything we need is there if we
just query it. But that’s just not true. I remember a time when I found
out I didn’t have enough data and had to find a solution. I was working for
an online retail company and was asked to figure out how to shorten
customer purchase to delivery time. Faster delivery times usually
lead to happier customers. When I checked the data set,
I found very limited tracking information. We were missing some pretty key details. So the data engineers and I created new
processes to track additional information, like the number of stops in a journey. Using this data, we reduced the time
it took from purchase to delivery and saw an improvement in customer
satisfaction. That felt pretty great! Learning how to deal with data issues
while staying focused on your objective will help set you up for success in
your career as a data analyst. And your path to success continues. Next step, you’ll learn more about
aligning data to objectives. Keep it up!
Reading: Well-aligned objectives and data
Practice Quiz: Test your knowledge on data integrity and analytics objectives
Which of the following principles are key elements of data integrity? Select all that apply.
Accuracy, Trustworthiness, Consistency
Data integrity is the accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle.
Which process do data analysts use to make data more organized and easier to read?
Data manipulation
To make data more organized and easier to read, data analysts use data manipulation.
Before analysis, a company collects data from countries that use different date formats. Which of the following updates would improve the data integrity?
Change all of the dates to the same format
Changing all of the dates to the same format would improve the data integrity.
Overcoming the challenges of insufficient data
Video: Dealing with insufficient data
Key Points:
- Insufficient data can hinder reliable analysis, even with the vast amount available.
- Setting limits: Define the scope and required data for your analysis beforehand.
- Real-world example: Forecasting support tickets needed a multi-year data window to account for seasonality.
- Common limitations and solutions:
- Limited source: Combine data from multiple sources for broader insights.
- Updating data: Wait for data to settle or adjust analysis objective (e.g., weekly trends instead of monthly).
- Outdated data: Find a newer, relevant data set.
- Geographically-limited data: Use global data for global analyses.
- Strategies: Identify trends within existing data, wait for more data, consult stakeholders, or find new data sources.
- Mastering these skills sets you up for success as a data analyst.
Bonus: Remember, the specific approach depends on your role and industry needs.
Next: Learn about statistical power, another valuable tool for data analysis.
This summary captures the key takeaways from the video, highlighting challenges and solutions for handling insufficient data in various scenarios.
️♀️ Data Detective: Handling Insufficient Data Like a Pro
Welcome to the world of data analysis, where even an abundance of information doesn’t always guarantee a smooth ride. Sometimes, you’ll face the challenge of insufficient data. But worry not, fellow analyst! I’m here to guide you through the maze of missing data and help you emerge with reliable insights.
Here’s your toolkit for navigating this dilemma:
1. Scope and Limits: Your Analysis Blueprint
- Define your business objective clearly: What questions are you aiming to answer?
- Determine the essential data elements: What information is crucial for drawing meaningful conclusions?
- Set boundaries for your analysis: What timeframe and geographical scope align with your objective?
2. Common Data Limitations and Solutions:
- Limited Source:
- Seek additional sources to broaden your data pool.
- Combine data from different platforms or providers, ensuring compatibility and quality.
- Updating Data:
- If possible, wait for more data to accumulate before conducting analysis.
- Adjust your analysis objective to accommodate the available data (e.g., analyze weekly trends instead of monthly).
- Outdated Data:
- Find a newer, more relevant dataset that reflects current trends.
- Geographically-Limited Data:
- Expand your scope to include a wider geographical range if applicable.
3. Strategies for Action:
- Identify Trends with Available Data:
- Look for patterns and tendencies within the limited dataset, acknowledging any potential biases.
- Wait for More Data (if time allows):
- If feasible, postpone analysis until sufficient data is available.
- Talk with Stakeholders and Adjust Objective:
- Collaborate with stakeholders to explore alternative objectives that can be addressed with the existing data.
- Look for a New Dataset:
- Seek external or internal sources for a dataset that meets your analysis requirements.
4. Transparency and Communication:
- Acknowledge limitations upfront: Be transparent with stakeholders about any constraints in the data.
- Express confidence intervals: State the range within which you’re confident about your results.
- Highlight assumptions and potential biases: Indicate any assumptions made during analysis and potential biases in the data.
Remember:
- Data analysis is an iterative process: Be prepared to adapt your approach as you encounter challenges.
- Consult with experts: Seek guidance from experienced analysts or statisticians when needed.
- Embrace uncertainty: Accept that some level of uncertainty is inherent in data analysis, and strive for the most reliable insights possible.
Practice with these tips, and you’ll become a master of dealing with insufficient data, ensuring the integrity and value of your analysis. Stay curious, stay adaptable, and keep those data detective skills sharp!
If you have to complete your analysis with insufficient data, how should you address this limitation?
Identify trends with the available data
If you have insufficient data, you can identify trends with the data that is available and qualify your findings accordingly.
Every analyst has been in a situation
where there is insufficient data to help with their business objective. Considering how much data is generated
every day, it may be hard to believe, but it’s true. So let’s discuss what you can do
when you have insufficient data. We’ll cover how to set limits for
the scope of your analysis and what data you should include. At one point,
I was a data analyst at a support center. Every day, we received customer questions,
which were logged in as support tickets. I was asked to forecast the number of
support tickets coming in per month to figure out how many additional
people we needed to hire. It was very important that we had
sufficient data spanning back at least a couple of years because I had to account
for year-to-year and seasonal changes. If I just had the current
year’s data available, I wouldn’t have known that
a spike in January is common and has to do with people asking for
refunds after the holidays. Because I had sufficient data, I was able to suggest we hire more
people in January to prepare. Challenges are bound to come up, but the good news is that once
you know your business objective, you’ll be able to recognize
whether you have enough data. And if you don’t, you’ll be able to deal
with it before you start your analysis. Now, let’s check out some of those
limitations you might come across and how you can handle different
types of insufficient data. Say you’re working in
the tourism industry, and you need to find out which travel
plans are searched most often. If you only use data
from one booking site, you’re limiting yourself to
data from just one source. Other booking sites might show different
trends that you would want to consider for your analysis. If a limitation like this impacts
your analysis, you can stop and go back to your stakeholders
to figure out a plan. If your data set keeps updating,
that means the data is still incoming and might not be complete. So if there’s a brand new tourist
attraction that you’re analyzing interest and attendance for, there’s probably not
enough data for you to determine trends. For example, you might want to
wait a month to gather data. Or you can check in with the stakeholders
and ask about adjusting the objective. For example, you might analyze trends from
week to week instead of month to month. You could also base your analysis on
trends over the past three months and say, “Here’s what attendance at the attraction
for month four could look like.” You might not have enough data to know
if this number is too low or too high. But you would tell stakeholders that it’s
your best estimate based on the data that you currently have. On the other hand, your data could
be older and no longer be relevant. Outdated data about customer satisfaction
won’t include the most recent responses. So you’ll be relying on the ratings for
hotels or vacation rentals that might
no longer be accurate. In this case, your best bet might be
to find a new data set to work with. Data that’s geographically-limited
could also be unreliable. If your company is global, you wouldn’t
want to use data limited to travel in just one country. You would want
a data set that includes all countries. So that’s just a few of the most common
limitations you’ll come across and some ways you can address them. You can identify trends with the available
data or wait for more data if time allows; you can talk with stakeholders and
adjust your objective; or you can look for a new data set. The need to take these steps will
depend on your role in your company and possibly the needs of the wider industry. But learning how to deal with insufficient
data is always a great way to set yourself up for success. Your data analyst powers are growing
stronger. And just in time. After you learn more about limitations and
solutions, you’ll learn about statistical power,
another fantastic tool for you to use. See you soon!
Reading: What to do when you find an issue with your data
Reading
When you are getting ready for data analysis, you might realize you don’t have the data you need or you don’t have enough of it. In some cases, you can use what is known as proxy data in place of the real data. Think of it like substituting oil for butter in a recipe when you don’t have butter. In other cases, there is no reasonable substitute and your only option is to collect more data.
Consider the following data issues and suggestions on how to work around them.
Data issue 1: no data
Possible Solutions | Examples of solutions in real life |
---|---|
Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data. | If you are surveying employees about what they think about a new performance and bonus plan, use a sample for a preliminary analysis. Then, ask for another 3 weeks to collect the data from all employees. |
If there isn’t time to collect data, perform the analysis using proxy data from other datasets. This is the most common workaround. | If you are analyzing peak travel times for commuters but don’t have the data for a particular city, use the data from another city with a similar size and demographic. |
Data issue 2: too little data
Possible Solutions | Examples of solutions in real life |
---|---|
Do the analysis using proxy data along with actual data. | If you are analyzing trends for owners of golden retrievers, make your dataset larger by including the data from owners of labradors. |
Adjust your analysis to align with the data you already have. | If you are missing data for 18- to 24-year-olds, do the analysis but note the following limitation in your report: this conclusion applies to adults 25 years and older only. |
Data issue 3: wrong data, including data with errors*
Possible Solutions | Examples of solutions in real life |
---|---|
If you have the wrong data because requirements were misunderstood, communicate the requirements again. | If you need the data for female voters and received the data for male voters, restate your needs. |
Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors. | If your data is in a spreadsheet and there is a conditional statement or boolean causing calculations to be wrong, change the conditional statement instead of just fixing the calculated values. |
If you can’t correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias. | If your dataset was translated from a different language and some of the translations don’t make sense, ignore the data with bad translation and go ahead with the analysis of the other data. |
* Important note: sometimes data with errors can be a warning sign that the data isn’t reliable. Use your best judgment.
Use the following decision tree as a reminder of how to deal with data errors or not enough data:
![](https://i0.wp.com/stackfolio.xyz/wp-content/uploads/2024/01/data-errors-1024x708.png?resize=1024%2C708&ssl=1)
Video: The importance of sample size
Key Points:
- Samples vs. Populations: Analyzing populations (all data values) is ideal, but often impractical.
- Sample Size: Using a representative portion of the population to draw conclusions about the whole.
- Benefits: Faster, cheaper, still insightful.
- Trade-off: Uncertainty – conclusions may not perfectly reflect the population.
- Sampling Bias: Unequal representation of population groups can skew results.
- Random Sampling: Selecting samples with equal chance for each member minimizes bias.
- Sample Size Planning: Done before data analysis to ensure representativeness.
Remember: Samples offer powerful insights while saving time and resources. Choosing them wisely through random sampling helps mitigate uncertainty and maintain data integrity.
Next: Dive deeper into sample size calculations and applications!
Imagine you’re on a quest to understand the preferences of dog owners in Los Angeles. You could try interviewing every single dog owner – a monumental task! Luckily, data analysis offers a secret weapon: sample size.
Why Sample Size Matters:
Think of the entire population of dog owners in LA as a vast land you want to explore. Analyzing every person (your population) would be ideal, but who has the time or resources? This is where the sample size comes in – your trusty mini-me of the population. It’s a smaller, manageable group chosen to represent the larger whole. By analyzing this mini-me, you can draw insights about the entire population.
Benefits of Sample Size:
- Efficiency: Saves time and resources compared to analyzing the entire population.
- Feasibility: Makes large-scale studies possible.
- Cost-effectiveness: Saves money compared to comprehensive data collection.
The Trade-off: Uncertainty:
While convenient, samples come with a slight hitch: uncertainty. Your mini-me might not perfectly reflect the larger population. Imagine drawing a random sample of just five dog owners – their favorite kibble may not represent the preferences of all LA dog owners.
Minimizing Uncertainty:
Here’s how we can make our trusty mini-me a more accurate reflection:
- Sample Size Calculation: Using statistical formulas, we can determine the minimum sample size needed for reliable results. Think of it as finding the perfect miniature size for your landscape model.
- Random Sampling: Choosing your mini-me members randomly ensures everyone has an equal chance of being included, reducing bias and making your sample more representative. Imagine blindly picking names from a hat – everyone gets a fair shot!
- Stratification: Dividing the population into subgroups (e.g., dog breeds) and then randomly sampling from each subgroup ensures all groups are represented proportionally. It’s like creating miniature versions of each neighborhood in your landscape model.
Mastering Sample Size:
By understanding the trade-off between efficiency and uncertainty, and by employing best practices like sample size calculation and random sampling, you can harness the power of samples to draw valuable insights from even the mightiest populations.
Remember: Sample size is a crucial tool in any data analyst’s arsenal. Use it wisely, and you’ll unlock the secrets of extracting accurate and actionable insights from the vast world of data.
Next Steps:
- Dive deeper into sample size calculations and explore different sampling techniques.
- Practice applying sample size concepts to real-world data analysis scenarios.
- Become a sample size master and conquer the uncertainty monster!
This tutorial provides a high-level overview of the importance of sample size in data analysis. It highlights the benefits, trade-offs, and strategies for minimizing uncertainty, encouraging viewers to delve deeper into the topic and hone their sampling skills.
What are some of the possible challenges associated with using 100% of a population in data analysis? Select all that apply.
Using 100% of a population is expensive. Using 100% of a population is time-consuming.
Using 100% of a population is time-consuming and expensive.
Okay, earlier we
talked about having the right kind of data to
meet your business objective and the importance of having
the right amount of data to make sure your analysis is
as accurate as possible. You might remember that
for data analysts, a population is all
possible data values in a certain dataset. If you’re able to
use 100 percent of a population in your
analysis, that’s great. But sometimes collecting
information about an entire population
just isn’t possible. It’s too time-consuming
or expensive. For example, let’s say
a global organization wants to know more about
pet owners who have cats. You’re tasked with finding
out which kinds of toys cat owners in Canada prefer. But there’s millions of
cat owners in Canada, so getting data from all of them would be a huge challenge. Fear not! Allow me to
introduce you to… sample size! When you use sample
size or a sample, you use a part of a population that’s representative
of the population. The goal is to get
enough information from a small group within a population to make predictions or conclusions about
the whole population. The sample size helps ensure the degree
to which you can be confident that your conclusions accurately represent
the population. For the data on cat owners, a sample size might
contain data about hundreds or thousands of
people rather than millions. Using a sample for
analysis is more cost-effective and
takes less time. If done carefully
and thoughtfully, you can get the
same results using a sample size instead
of trying to hunt down every single cat owner to find out their
favorite cat toys. There is a potential
downside, though. When you only use a small
sample of a population, it can lead to uncertainty. You can’t really be 100 percent sure that your statistics are a complete and accurate
representation of the population. This leads to sampling bias, which we covered
earlier in the program. Sampling bias is when a sample isn’t representative of
the population as a whole. This means some members
of the population are being overrepresented
or underrepresented. For example, if the survey
used to collect data from cat owners only included
people with smartphones, then cat owners who don’t have a smartphone wouldn’t be
represented in the data. Using random sampling can help address some of those
issues with sampling bias. Random sampling is a way of selecting a sample
from a population so that every possible type of the sample has an equal
chance of being chosen. Going back to our
cat owners again, using a random sample of
cat owners means cat owners of every type have an equal
chance of being chosen. Cat owners who live in
apartments in Ontario would have the same chance of
being represented as those who live in
houses in Alberta. As a data analyst, you’ll find that creating
sample sizes usually takes place before you
even get to the data. But it’s still good
for you to know that the data you are
going to analyze is representative of the population and works with your objective. It’s also good to know what’s coming up in your data journey. In the next video, you’ll
have an option to become even more comfortable with sample
sizes. See you there.
Reading: Calculating sample size
Reading
Before you dig deeper into sample size, familiarize yourself with these terms and definitions:
Terminology | Definitions |
---|---|
Population | The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company. |
Sample | A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population. |
Margin of error | Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population. |
Confidence level | How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study. |
Confidence interval | The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error. |
Statistical significance | The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance. |
Things to remember when determining the size of your sample
When figuring out a sample size, here are things to keep in mind:
- Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.
- The confidence level most commonly used is 95%, but 90% can work in some cases.
Increase the sample size to meet specific needs of your project:
- For a higher confidence level, use a larger sample size
- To decrease the margin of error, use a larger sample size
- For greater statistical significance, use a larger sample size
Note: Sample size calculators use statistical formulas to determine a sample size. More about these are coming up in the course! Stay tuned.
Why a minimum sample of 30?
This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid. Researchers who rely on regression analysis – statistical methods to determine the relationships between controlled and dependent variables – also prefer a minimum sample of 30.
Still curious? Without getting too much into the math, check out these articles:
- Central Limit Theorem (CLT): This article by Investopedia explains the Central Limit Theorem and briefly describes how it can apply to an analysis of a stock index.
- Sample Size Formula: This article by Statistics Solutions provides a little more detail about why some researchers use 30 as a minimum sample size.
Sample sizes vary by business problem
Sample size will vary based on the type of business problem you are trying to solve.
For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller sample size look like?
Would 200 be alright if the people surveyed represented every district in the city?
Answer: It depends on the stakes.
- A sample size of 200 might be large enough if your business problem is to find out how residents felt about the new library
- A sample size of 200 might not be large enough if your business problem is to determine how residents would vote to fund the library
You could probably accept a larger margin of error surveying how residents feel about the new library versus surveying residents about how they would vote to fund it. For that reason, you would most likely use a larger sample size for the voter survey.
Larger sample sizes have a higher cost
You also have to weigh the cost against the benefits of more accurate results with a larger sample size. Someone who is trying to understand consumer preferences for a new line of products wouldn’t need as large a sample size as someone who is trying to understand the effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a smaller sample size at a lower cost could provide good enough results.
Knowing the basics is helpful
Knowing the basics will help you make the right choices when it comes to sample size. You can always raise concerns if you come across a sample size that is too small. A sample size calculator is also a great tool for this. Sample size calculators let you enter a desired confidence level and margin of error for a given population size. They then calculate the sample size needed to statistically achieve those results.
Refer to the Determine the Best Sample Size video for a demonstration of a sample size calculator, or refer to the Sample Size Calculator reading for additional information.
Practice Quiz: Test your knowledge on insufficient data
What should an analyst do if they do not have the data needed to meet a business objective? Select all that apply.
- Perform the analysis by finding and using proxy data from other datasets.
- Gather related data on a small scale and request additional time to find more complete data.
If an analyst does not have the data needed to meet a business objective, they should gather related data on a small scale and request additional time. Then, they can find more complete data or perform the analysis by finding and using proxy data from other datasets.
Which of the following are limitations that might lead to insufficient data? Select all that apply.
Data from a single source, Outdated data, Data that updates continually
Limitations that might lead to insufficient data include data that updates continually, outdated data, and data from a single source.
A data analyst wants to find out how many people in Utah have swimming pools. It’s unlikely that they can survey every Utah resident. Instead, they survey enough people to be representative of the population. This describes what data analytics concept?
Sample
This describes a sample, which is a part of a population that is representative of the whole.
Testing your data
Video: Using statistical power
Main points:
- Introduction: Statistical power as a “data superpower” for getting meaningful results from tests.
- Example: Testing a milkshake ad on a sample of customers to gauge its effectiveness.
- Larger sample size: Increases the chance of statistically significant results.
- Statistical power: Probability of achieving significant results, typically shown as a value out of 1 (e.g., 0.6 = 60%).
- Statistical significance: Means results are real and not due to random chance.
- Target power: 80% or higher for reliable conclusions.
- Second example: Testing a new milkshake flavor in limited locations, considering factors impacting power.
- Measurable effects: Increased sales or customer numbers in the test locations.
- Conclusion: Statistical power is a crucial tool for data analysts, even if it lacks the flashy appeal of flying.
Key takeaways:
- Understand the concept of statistical power and its importance in testing.
- Recognize the connection between sample size and power.
- Interpret statistical power values and their implications for results.
- Be aware of factors influencing power and consider them when designing tests.
This summary captures the key ideas of the video, providing a concise overview of this essential data analysis concept. It also maintains a lighthearted tone with the milkshake examples, making the topic more engaging and relatable.
Introduction:
- What is statistical power? The probability of detecting a true effect or difference in a study, if it actually exists.
- Why is it important? Helps ensure that your results are reliable and not due to chance, aids in designing studies with adequate sample sizes.
Key Concepts:
- Significance level (alpha): The threshold for deciding whether a result is statistically significant (typically 0.05).
- Effect size: The magnitude of the difference or relationship you’re studying (small, medium, or large).
- Sample size: The number of participants or observations in your study.
- Power calculations: Used to determine the appropriate sample size to achieve a desired level of power (typically 80% or higher).
Steps in Using Statistical Power:
- Define your research question and hypothesis.
- Determine the appropriate significance level (alpha).
- Estimate the expected effect size (based on previous research or pilot studies).
- Conduct a power analysis to calculate the required sample size.
- Collect data and conduct your analysis.
- Interpret your results in the context of statistical power.
Factors Affecting Statistical Power:
- Sample size: Increasing sample size increases power.
- Effect size: Larger effect sizes are easier to detect, requiring smaller sample sizes.
- Significance level: More stringent alpha levels (e.g., 0.01 instead of 0.05) decrease power.
- Variability in the data: Higher variability reduces power.
Tools for Power Analysis:
- Statistical software packages: SPSS, SAS, R, G*Power.
- Online power calculators: Available from various sources.
Best Practices:
- Plan for power: Incorporate power analysis into study design early.
- Aim for high power: Target 80% or higher to minimize false negatives.
- Consider practical constraints: Balance power with resources and feasibility.
- Report power calculations: Include in research publications for transparency.
Additional Considerations:
- Ethical implications: Avoid unnecessarily large sample sizes that place burdens on participants.
- Alternative approaches: Consider Bayesian methods that incorporate prior information and aren’t as reliant on sample size.
Remember: Statistical power is an essential tool for robust and reliable data analysis. By understanding and applying it correctly, you can enhance the validity and impact of your research findings.
Hey, there. We’ve all probably dreamed of having
a superpower at least once in our lives. I know I have. I’d love to be able to fly. But there’s another superpower you might
not have heard of: statistical power. Statistical power is the probability of
getting meaningful results from a test. I’m guessing that’s not a superpower
any of you have dreamed about. Still, it’s a pretty
great data superpower. For data analysts, your projects
might begin with the test or study. Hypothesis testing is a way
to see if a survey or experiment has meaningful results. Here’s an example. Let’s say you work for a restaurant chain
that’s planning a marketing campaign for their new milkshakes. You need to test the ad on a group
of customers before turning it into a nationwide ad campaign. In the test, you want to check whether
customers like or dislike the campaign. You also want to rule out any factors
outside of the ad that might lead them to say they don’t like it. Using all your customers would be
too time consuming and expensive. So, you’ll need to figure out how many
customers you’ll need to show that the ad is effective. Fifty probably wouldn’t be enough. Even if you randomly chose 50 customers, you might end up with customers
who don’t like milkshakes at all. And if that happens, you won’t be able to
measure the effectiveness of your ad in getting more milkshake orders since no
one in the sample size would order them. That’s why you need a larger sample size: so you can make sure you get a good number
of all types of people for your test. Usually, the larger the sample size,
the greater the chance you’ll have statistically significant
results with your test. And that’s statistical power. In this case, using as many customers as
possible will show the actual differences between the groups who like or dislike the ad versus people whose
decision wasn’t based on the ad at all. There are ways to accurately
calculate statistical power, but we won’t go into them here. You might need to calculate it
on your own as a data analyst. For now, you should know that statistical
power is usually shown as a value out of one. So if your statistical power is 0.6,
that’s the same thing as saying 60%. In the milk shake ad test, if you found a statistical power of 60%,
that means there’s a 60% chance of you getting a statistically significant
result on the ad’s effectiveness. “Statistically significant”
is a term that is used in statistics. If you want to learn more about the
technical meaning, you can search online. But in basic terms,
if a test is statistically significant, it means the results of the test are real
and not an error caused by random chance. So there’s a 60% chance that
the results of the milkshake ad test are reliable and real and a 40% chance that the result
of the test is wrong. Usually, you need a statistical
power of at least 0.8 or 80% to consider your results
statistically significant. Let’s check out one more scenario. We’ll stick with milkshakes because,
well, because I like milkshakes. Imagine you work for a restaurant chain
that wants to launch a brand-new birthday cake flavored milkshake. This milkshake will be more expensive
to produce than your other milkshakes. Your company hopes that the buzz around
the new flavor will bring in more customers and money to offset this cost. They want to test this out in
a few restaurant locations first. So let’s figure out how many locations
you’d have to use to be confident in your results. First, you’d have to think about
what might prevent you from getting statistically significant results. Are there restaurants running any
other promotions that might bring in new customers? Do some restaurants have customers
that always buy the newest item, no matter what it is? Do some location have construction
that recently started, that would prevent customers from
even going to the restaurant? To get a higher statistical power, you’d
have to consider all of these factors before you decide how many locations
to include in your sample size for your study. You want to make sure any effect is most
likely due to the new milkshake flavor, not another factor. The measurable effects would
be an increase in sales or the number of customers at the
locations in your sample size. That’s it for now. Coming up, we’ll explore sample
sizes in more detail, so you can get a better idea of how
they impact your tests and studies. In the meantime, you’ve gotten to know
a little bit more about milkshakes and superpowers. And of course, statistical power. Sadly, only statistical power can
truly be useful for data analysts. Though putting on my cape and flying to grab a milkshake right
now does sound pretty good.
Reading: What to do when there is no data
Reading
Earlier, you learned how you can still do an analysis using proxy data if you have no data. You might have some questions about proxy data, so this reading will give you a few more examples of the types of datasets that can serve as alternate data sources.
Proxy data examples
Sometimes the data to support a business objective isn’t readily available. This is when proxy data is useful. Take a look at the following scenarios and where proxy data comes in for each example:
Business scenario | How proxy data can be used |
---|---|
A new car model was just launched a few days ago and the auto dealership can’t wait until the end of the month for sales data to come in. They want sales projections now. | The analyst proxies the number of clicks to the car specifications on the dealership’s website as an estimate of potential sales at the dealership. |
A brand new plant-based meat product was only recently stocked in grocery stores and the supplier needs to estimate the demand over the next four years. | The analyst proxies the sales data for a turkey substitute made out of tofu that has been on the market for several years. |
The Chamber of Commerce wants to know how a tourism campaign is going to impact travel to their city, but the results from the campaign aren’t publicly available yet. | The analyst proxies the historical data for airline bookings to the city one to three months after a similar campaign was run six months earlier. |
Open (public) datasets
If you are part of a large organization, you might have access to lots of sources of data. But if you are looking for something specific or a little outside your line of business, you can also make use of open or public datasets. (You can refer to this Medium article for a brief explanation of the difference between open and public data.)
Here’s an example. A nasal version of a vaccine was recently made available. A clinic wants to know what to expect for contraindications, but just started collecting first-party data from its patients. A contraindication is a condition that may cause a patient not to take a vaccine due to the harm it would cause them if taken. To estimate the number of possible contraindications, a data analyst proxies an open dataset from a trial of the injection version of the vaccine. The analyst selects a subset of the data with patient profiles most closely matching the makeup of the patients at the clinic.
There are plenty of ways to share and collaborate on data within a community. Kaggle (kaggle.com) which we previously introduced, has datasets in a variety of formats including the most basic type, Comma Separated Values (CSV) files.
CSV, JSON, SQLite, and BigQuery datasets
- CSV: Check out this Credit card customers dataset, which has information from 10,000 customers including age, salary, marital status, credit card limit, credit card category, etc. (CC0: Public Domain, Sakshi Goyal).
- JSON: Check out this JSON dataset for trending YouTube videos (CC0: Public Domain, Mitchell J).
- SQLite: Check out this SQLite dataset for 24 years worth of U.S. wildfire data (CC0: Public Domain, Rachael Tatman).
- BigQuery: Check out this Google Analytics 360 sample dataset from the Google Merchandise Store (CC0 Public Domain, Google BigQuery).
Refer to the Kaggle documentation for datasets for more information and search for and explore datasets on your own at kaggle.com/datasets.
As with all other kinds of datasets, be on the lookout for duplicate data and ‘Null’ in open datasets. Null most often means that a data field was unassigned (left empty), but sometimes Null can be interpreted as the value, 0. It is important to understand how Null was used before you start analyzing a dataset with Null data.
Video: Determine the best sample size
Main points:
- Sample size: A representative subset of a larger population used for studies.
- Benefits: Reduces cost and time compared to analyzing the entire population.
- Confidence level: Probability that the sample accurately reflects the larger population (ideally 90-95%).
- Margin of error: How close the sample results are likely to be to the population results (smaller is better).
- Sample size calculator: Online tools help determine appropriate sample size based on population, confidence level, and margin of error.
- Example: Calculating sample size for a student candy preference study with 500 students, 95% confidence level, and 5% margin of error – result: 218 students.
- Key takeaway: Understanding sample size, confidence level, and margin of error is crucial for accurate data analysis.
Additional notes:
- Emphasizes the importance of data integrity for valid results.
- References sample size as a “data superpower” similar to the previous video on statistical power.
- Encourages viewers to practice with online calculators for practical application.
This summary captures the key information from the video, providing a concise overview of sample size and its connection to confidence level and margin of error. The example and practical tips make the concepts relatable and easily applicable.
Introduction:
- What is sample size? The number of participants or observations included in a study.
- Importance:
- Crucial for accurate and reliable results.
- Too small: May miss important effects or relationships.
- Too large: Wastes resources and time.
Key Factors to Consider:
- Confidence Level: The desired probability that your sample results accurately reflect the population (e.g., 95%, 99%).
- Margin of Error: The acceptable amount of difference between your sample results and the true population values (e.g., 5%, 3%).
- Population Size: The total number of individuals or units in the population you’re studying.
- Variability: The degree to which data points differ from each other. Higher variability requires larger samples.
- Effect Size: The magnitude of the difference or relationship you’re trying to detect. Larger effect sizes require smaller samples.
- Statistical Test: Different tests have different sample size requirements.
- Practical Constraints: Time, cost, and feasibility of data collection.
Steps to Determine Sample Size:
- Define Your Research Question and Hypothesis: Clearly articulate what you’re trying to investigate.
- Specify Confidence Level and Margin of Error: Choose levels that align with your research goals and tolerance for uncertainty.
- Estimate Population Variability: Use prior research, pilot studies, or expert knowledge to estimate variability.
- Consider Effect Size: If possible, estimate the expected effect size to refine sample size calculations.
- Choose a Statistical Test: Select the appropriate test based on your research question and data type.
- Use a Sample Size Calculator: Online calculators or statistical software can perform calculations based on input parameters.
- Consult with a Statistician: For complex studies or specific needs, seek guidance from a statistician.
Additional Considerations:
- Ethical Issues: Balance statistical needs with potential burdens on participants.
- Non-Response: Anticipate potential non-response and adjust sample size accordingly.
- Subgroup Analysis: If planning to analyze subgroups, ensure adequate sample sizes within each group.
- Power Analysis: Conduct a power analysis to determine the probability of detecting a true effect, given your sample size and other study parameters.
Best Practices:
- Plan Early: Incorporate sample size determination into the early stages of study design.
- Consider Multiple Factors: Base sample size decisions on a comprehensive assessment of factors.
- Document Decisions: Clearly report sample size calculations and rationale in research publications.
- Seek Expert Advice: Consult statisticians or other experts for guidance, especially for complex studies.
Remember: Determining the best sample size is a critical step in ensuring valid and meaningful results in data analysis. By carefully considering the factors involved and following best practices, you can make informed decisions that lead to accurate and reliable conclusions.
Great to see you again. In this video, we’ll
go into more detail about sample sizes
and data integrity. If you’ve ever been to a
store that hands out samples, you know it’s one of
life’s little pleasures. For me, anyway! those small samples are
also a very smart way for businesses to learn more
about their products from customers without having to
give everyone a free sample. A lot of organizations use
sample size in a similar way. They take one part
of something larger. In this case, a sample
of a population. Sometimes they’ll
perform complex tests on their data to see if it meets
their business objectives. We won’t go into all
the calculations needed to do this effectively. Instead, we’ll focus
on a “big picture” look at the process
and what it involves. As a quick reminder, sample size is a part of a population that is
representative of the population. For businesses, it’s a
very important tool. It can be both expensive and time-consuming to analyze an
entire population of data. Using sample size usually makes the most sense and can still lead to valid and
useful findings. There are handy calculators online that can help
you find sample size. You need to input the
confidence level, population size, and
margin of error. We’ve talked about
population size before. To build on that,
we’ll learn about confidence level and
margin of error. Knowing about these
concepts will help you understand why you need them
to calculate sample size. The confidence level is
the probability that your sample accurately reflects
the greater population. You can think of it the same way as confidence
in anything else. It’s how strongly
you feel that you can rely on something or someone. Having a 99 percent
confidence level is ideal. But most industries
hope for at least a 90 or 95 percent
confidence level. Industries like pharmaceuticals usually want a confidence level that’s as high as possible when they are using a sample size. This makes sense because they’re testing
medicines and need to be sure they work and are
safe for everyone to use. For other studies, organizations might
just need to know that the test or survey results have them heading in
the right direction. For example, if a paint company is testing out new colors, a lower confidence level is okay. You also want to consider the margin of error
for your study. You’ll learn more
about this soon, but it basically tells you how close your sample size
results are to what your results would be if you use the entire population that
your sample size represents. Think of it like this. Let’s say that the principal of a middle school approaches you with a study about
students’ candy preferences. They need to know an appropriate sample size, and they need it now. The school has a student
population of 500, and they’re asking for
a confidence level of 95 percent and a margin
of error of 5 percent. We’ve set up a calculator
in a spreadsheet, but you can also easily find
this type of calculator by searching “sample size
calculator” on the internet. Just like those calculators, our spreadsheet calculator
doesn’t show any of the more complex calculations for figuring out sample size. All we need to do is input the numbers for our population, confidence level,
and margin of error. And when we type 500 for
our population size, 95 for our confidence
level percentage, 5 for our margin
of error percentage, the result is about 218. That means for this study, an appropriate sample
size would be 218. If we surveyed 218 students and found that 55 percent of
them preferred chocolate, then we could be
pretty confident that would be true of
all 500 students. 218 is the minimum number of people we need
to survey based on our criteria of a 95
percent confidence level and a 5 percent margin of error. In case you’re wondering, the confidence level
and margin of error don’t have to add
up to 100 percent. They’re independent
of each other. So let’s say we change
our margin of error from 5 percent
to 3 percent. Then we find that our sample size would need to be larger, about 341 instead of 218, to make the results of the study more representative
of the population. Feel free to practice with
an online calculator. Knowing sample size and how to find it will help you
when you work with data. We’ve got more useful
knowledge coming your way, including learning about
margin of error. See you soon!
Reading: Sample size calculator
Reading
In this reading, you will learn the basics of sample size calculators, how to use them, and how to understand the results. A sample size calculator tells you how many people you need to interview (or things you need to test) to get results that represent the target population. Let’s review some terms you will come across when using a sample size calculator:
- Confidence level: The probability that your sample size accurately reflects the greater population.
- Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population.
- Population: This is the total number you hope to pull your sample from.
- Sample: A part of a population that is representative of the population.
- Estimated response rate: If you are running a survey of individuals, this is the percentage of people you expect will complete your survey out of those who received the survey.
How to use a sample size calculator
In order to use a sample size calculator, you need to have the population size, confidence level, and the acceptable margin of error already decided so you can input them into the tool. If this information is ready to go, check out these sample size calculators below:
What to do with the results
After you have plugged your information into one of these calculators, it will give you a recommended sample size. Keep in mind, the calculated sample size is the minimum number to achieve what you input for confidence level and margin of error. If you are working with a survey, you will also need to think about the estimated response rate to figure out how many surveys you will need to send out. For example, if you need a sample size of 100 individuals and your estimated response rate is 10%, you will need to send your survey to 1,000 individuals to get the 100 responses you need for your analysis.
Now that you have the basics, try some calculations using the sample size calculators and refer back to this reading if you need a refresher on the definitions.
Practice Quiz: Test your knowledge on testing your data
A research team runs an experiment to determine if a new security system is more effective than the previous version. What type of results are required for the experiment to be statistically significant?
Results that are real and not caused by random chance
In order for an experiment to be statistically significant, the results should be real and not caused by random chance.
In order to have a high confidence level in a customer survey, what should the sample size accurately reflect?
The entire population
In order to have a high confidence level in a customer survey, the sample size should accurately reflect the entire population.
A data analyst determines an appropriate sample size for a survey. They can check their work by making sure the confidence level percentage plus the margin of error percentage add up to 100%.
False
The confidence level percentage and margin of error percentage do not have to add up to 100%. They are independent of each other.
Consider the margin of error
Video: Evaluate the reliability of your data
Key points:
- Definition: Margin of error is the maximum expected difference between sample results and the actual population.
- Importance: Helps understand the reliability of sample-based data and its relevance to the entire population.
- Calculation: Based on population size, sample size, and confidence level (e.g., 95%). Higher confidence means a lower margin of error.
- Example: Survey about a 4-day workweek with 60% approval and 10% margin of error suggests actual population support between 50% and 70%.
- Interpretation: Smaller margin of error (e.g., 5%) indicates more reliable results and confidence in generalizing to the population.
- Resources: Online calculators and spreadsheets can help calculate margin of error based on your data.
Overall:
Understanding margin of error is crucial for data analysts to assess the accuracy and generalizability of their results. By calculating and interpreting the margin of error, they can make more informed conclusions and avoid misleading implications based on small-scale samples.
Additional notes:
- The video emphasizes the connection between sample size, confidence level, and margin of error, showing how changing one affects the others.
- Data integrity and alignment with objectives are highlighted as essential elements for reliable analysis.
This summary provides a concise overview of the video’s key points and takeaways. Remember, you can always revisit the video or the glossary for further details and exploration.
Introduction
In data analysis, the reliability of your data is paramount to producing accurate and trustworthy insights. It’s essential to assess the quality and trustworthiness of your data before embarking on any analysis to ensure your findings are valid and meaningful. This tutorial will guide you through the key steps and considerations involved in evaluating data reliability.
Key Considerations
1. Data Source and Collection Methods:
- Credibility: Assess the reputation and expertise of the data source.
- Transparency: Understand the data collection methods and any potential biases or limitations.
- Consistency: Ensure data collection procedures were consistent across time and participants.
2. Data Completeness and Accuracy:
- Missing Values: Identify and address missing data through imputation or exclusion, depending on the analysis.
- Errors: Check for inconsistencies, outliers, or errors in the data, and rectify them as needed.
- Validation: Compare data against external sources or expert knowledge to verify its accuracy.
3. Sampling Methods:
- Representation: Evaluate whether the sample is representative of the population you’re studying.
- Randomness: Ensure the sample was selected randomly to avoid selection bias.
- Size: Consider the sample size and its impact on statistical power and margin of error.
4. Data Integrity:
- Security: Verify that data has been protected from unauthorized access or tampering.
- Consistency: Ensure data formatting and definitions are consistent across different sources.
- Documentation: Review any available documentation about data collection, cleaning, and storage procedures.
5. Statistical Measures:
- Margin of Error: Calculate the margin of error to understand the potential range of values in the population.
- Confidence Intervals: Use confidence intervals to express the uncertainty associated with estimates.
- Statistical Tests: Apply appropriate statistical tests to assess the significance of findings and relationships within the data.
Additional Considerations:
- Data Bias: Be mindful of potential biases in data collection or analysis, such as selection bias, measurement bias, or confirmation bias.
- Data Consistency: Ensure data aligns with your research objectives and questions.
- External Validation: Consider validating findings with external data sources or expert opinions.
Conclusion
Evaluating data reliability is an essential step in any data analysis project. By carefully considering these factors, you can build confidence in your data and ensure the validity of your conclusions, leading to more informed decision-making and meaningful insights.
Hey there! Earlier, we touched on margin of
error without explaining it completely. Well, we’re going to right that wrong
in this video by explaining margin of error more. We’ll even include an example
of how to calculate it. As a data analyst, it’s important for you
to figure out sample size and variables like confidence level and margin of error
before running any kind of test or survey. It’s the best way to make sure
your results are objective, and it gives you a better chance of getting
statistically significant results. But if you already know the sample size,
like when you’re given survey results to analyze, you can calculate
the margin of error yourself. Then you’ll have a better idea of how much of
a difference there is between your sample and your population. We’ll start at the beginning
with a more complete definition. Margin of error is the maximum that
the sample results are expected to differ from those of
the actual population. Let’s think about an example
of margin of error. It would be great to survey or
test an entire population, but it’s usually impossible or
impractical to do this. So instead, we take a sample
of the larger population. Based on the sample size, the resulting margin of error will tell
us how different the results might be compared to the results if we had
surveyed the entire population. Margin of error helps you understand how
reliable the data from your hypothesis testing is. The closer to zero the margin of error,
the closer your results from your sample would match results
from the overall population. For example, let’s say you completed a nationwide
survey using a sample of the population. You asked people who work five-day
workweeks whether they like the idea of a four-day workweek. So your survey tells you that
60% prefer a four-day workweek. The margin of error was 10%, which tells us that between
50 and 70% like the idea. So if we were to survey all
five-day workers nationwide, between 50 and 70% would
agree with our results. Keep in mind that our range
is between 50 and 70%. That’s because the margin of error
is counted in both directions from the survey results of 60%. If you set up a 95% confidence
level for your survey, there’ll be a 95% chance that the
entire population’s responses will fall between 50 and 70% saying, yes,
they want a four-day workweek. Since your margin of error overlaps
with that 50% mark, you can’t say for sure that the public likes
the idea of a four-day workweek. In that case, you’d have to say
your survey was inconclusive. Now, if you wanted a lower
margin of error, say 5%, with a range between 55 and 65%,
you could increase the sample size. But if you’ve already been
given the sample size, you can calculate the margin
of error yourself. Then you can decide yourself how
much of a chance your results have of being statistically significant
based on your margin of error. In general, the more people
you include in your survey, the more likely your sample is
representative of the entire population. Decreasing the confidence level
would also have the same effect, but that would also make it less
likely that your survey is accurate. So to calculate margin of
error, you need three things: population size, sample size,
and confidence level. And just like with sample size, you can find lots of calculators online by
searching “margin of error calculator.” But we’ll show you in a spreadsheet, just like we did when we
calculated sample size. Lets say you’re running a study on
the effectiveness of a new drug. You have a sample size
of 500 participants whose condition affects 1%
of the world’s population. That’s about 80 million people,
which is the population for your study. Since it’s a drug study, you need
to have a confidence level of 99%. You also need a low margin of error. Let’s calculate it. We’ll put the numbers for population, confidence level, and sample size, in the appropriate
spreadsheet cells. And our result is a margin of error
of close to 6%, plus or minus. When the drug study is complete, you’d
apply the margin of error to your results to determine how reliable
your results might be. Calculators like this one in the
spreadsheet are just one of the many tools you can use to ensure data integrity. And it’s also good to remember that
checking for data integrity and aligning the data with your objectives will put you
in good shape to complete your analysis. Knowing about sample size,
statistical power, margin of error, and other topics we’ve covered
will help your analysis run smoothly. That’s a lot of new concepts to take in. If you’d like to review them at any time, you can find them all in the glossary,
or feel free to rewatch the video! Soon you’ll explore the ins and
outs of clean data. The data adventure keeps moving! I’m so glad you’re moving along with it. You got this!
Reading: All about margin of error
Reading
Margin of error is the maximum amount that the sample results are expected to differ from those of the actual population. More technically, the margin of error defines a range of values below and above the average result for the sample. The average result for the entire population is expected to be within that range. We can better understand margin of error by using some examples below.
Margin of error in baseball
Imagine you are playing baseball and that you are up at bat. The crowd is roaring, and you are getting ready to try to hit the ball. The pitcher delivers a fastball traveling about 90-95mph, which takes about 400 milliseconds (ms) to reach the catcher’s glove. You swing and miss the first pitch because your timing was a little off. You wonder if you should have swung slightly earlier or slightly later to hit a home run. That time difference can be considered the margin of error, and it tells us how close or far your timing was from the average home run swing.
Margin of error in marketing
The margin of error is also important in marketing. Let’s use A/B testing as an example. A/B testing (or split testing) tests two variations of the same web page to determine which page is more successful in attracting user traffic and generating revenue. User traffic that gets monetized is known as the conversion rate. A/B testing allows marketers to test emails, ads, and landing pages to find the data behind what is working and what isn’t working. Marketers use the confidence interval (determined by the conversion rate and the margin of error) to understand the results.
For example, suppose you are conducting an A/B test to compare the effectiveness of two different email subject lines to entice people to open the email. You find that subject line A: “Special offer just for you” resulted in a 5% open rate compared to subject line B: “Don’t miss this opportunity” at 3%.
Does that mean subject line A is better than subject line B? It depends on your margin of error. If the margin of error was 2%, then subject line A’s actual open rate or confidence interval is somewhere between 3% and 7%. Since the lower end of the interval overlaps with subject line B’s results at 3%, you can’t conclude that there is a statistically significant difference between subject line A and B. Examining the margin of error is important when making conclusions based on your test results.
Want to calculate your margin of error?
All you need is population size, confidence level, and sample size. In order to better understand this calculator, review these terms:
- Confidence level: A percentage indicating how likely your sample accurately reflects the greater population
- Population: The total number you pull your sample from
- Sample: A part of a population that is representative of the population
- Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population
In most cases, a 90% or 95% confidence level is used. But, depending on your industry, you might want to set a stricter confidence level. A 99% confidence level is reasonable in some industries, such as the pharmaceutical industry.
After you have settled on your population size, sample size, and confidence level, plug the information into a margin of error calculator like the ones below:
- Margin of error calculator by Good Calculators (free online calculators)
- Margin of error calculator by CheckMarket
Key takeaway
Margin of error is used to determine how close your sample’s result is to what the result would likely have been if you could have surveyed or tested the entire population. Margin of error helps you understand and interpret survey or test results in real-life. Calculating the margin of error is particularly helpful when you are given the data to analyze. After using a calculator to calculate the margin of error, you will know how much the sample results might differ from the results of the entire population.
Practice Quiz: Test your knowledge on margin of error
Fill in the blank: Margin of error is the _____ amount that the sample results are expected to differ from those of the actual population.
maximum
Margin of error is the maximum amount that the sample results are expected to differ from those of the actual population.
In a survey about a new cleaning product, 75% of respondents report they would buy the product again. The margin of error for the survey is 5%. Based on the margin of error, what percentage range reflects the population’s true response?
Between 70% and 80%
Based on the margin of error, between 70% and 80% accurately reflects the population’s true response.
Module 1 challenge
Reading: Glossary: Terms and definitions
Quiz: Module 1 challenge
Fill in the blank: If a data analyst is using data that has been _____, the data will lack integrity and the analysis will be faulty.
compromised
If a data analyst is using data that has been compromised, the data will lack integrity and the analysis will be faulty.
A healthcare company keeps copies of their data at several locations across the country. The data becomes compromised because each location creates a copy of the original at different times of day. Which of the following processes caused the compromise?
Data replication
Data replication caused the compromise. Data replication is the process of storing data in multiple locations. If not done properly, replication can compromise integrity and cause inconsistencies.
A data analyst is given a dataset for analysis. It includes data about the total population of every country in the previous 20 years. Based on the available data, an analyst would be able to determine the reasons behind a certain country’s population increase from 2016 to 2017.
False
Based on the available data, the analyst would need more data to determine the reasons behind the population increase.
A data analyst is given a dataset for analysis. To use the template for this dataset, click the link below and select “Use Template.”
Link to template: June 2014 Invoices
Which of the following has duplicate data?
Data for Valando on 2/18/2014
Valando on 2/18/2014 contains duplicate data because the spreadsheet contains the same data in two different rows.
A data analyst is working on a project about the global supply chain. They have a dataset with lots of relevant data from Europe and Asia. However, they decide to generate new data that represents all continents. What type of insufficient data does this scenario describe?
Data that’s geographically limited
This example describes data that is insufficient because it’s geographically limited. If the analytics project has a global focus, the dataset should also be global.
When gathering data through a survey, companies can save money by surveying 100% of a population.
False
Using 100% of a population is ideal, but it can be very expensive to gather data from an entire population.
A restaurant wants to gather data about a new dish by giving out free samples and asking for feedback. Who should the restaurant give samples to?
All diners
Which of the following processes helps ensure a close alignment of data and business objectives?
Maintaining data integrity
Maintaining data integrity helps ensure a close alignment of data and business objectives because the data is likely to be accurate, complete, consistent, and trustworthy.