When data analysts work with data, they always check that the data is unbiased and credible. In this part of the course, you’ll learn how to identify different types of bias in data and how to ensure credibility in your data. You’ll also explore open data and the relationship between and importance of data ethics and data privacy.
Learning Objectives
- Explain what is involved in reviewing data to identify bias
- Discuss the difference between biased and unbiased data
- Identify different types of bias including confirmation, interpretation, and observer bias
- Discuss characteristics of credible sources of data including reference to untidy data
- Explain the concept of open data with reference to the ongoing debate in data analytics
- Define data ethics and data privacy
- Explain the relationship between data ethics and data privacy
- Demonstrate an understanding of the benefits of anonymizing data
- Demonstrate an awareness of the accessibility issues associated with open data
Unbiased and objective data
Video: Ensuring data integrity
This course will teach you how to analyze data for bias and credibility, identify good and bad data sources, and understand data ethics, privacy, and access.
Why is this important?
- Even the most sound data can be skewed or misinterpreted.
- Bad data can lead to bad decisions.
- As technology advances, new ethical questions arise about the use and collection of data.
As a data analyst, it is important to be able to identify and address these challenges in order to tell a meaningful story with your data.
In the next video, you will learn more about the first chapter of this data story: analyzing data for bias and credibility.
How to ensure data integrity in data analysis
Data integrity is the process of ensuring that data is accurate, complete, consistent, and reliable. It is important to ensure data integrity in data analysis because it can have a significant impact on the results of your analysis and the conclusions you draw.
Here are some tips for ensuring data integrity in data analysis:
- Identify and address potential sources of bias. Bias can be introduced into data at any stage of the data collection, processing, and analysis process. It is important to be aware of potential sources of bias and to take steps to mitigate them. For example, if you are collecting data through surveys, make sure that your survey is well-designed and that it does not bias the results.
- Validate your data. Once you have collected your data, it is important to validate it to ensure that it is accurate and complete. This may involve checking for typos, inconsistencies, and missing values. You may also want to use statistical methods to identify and remove outliers.
- Use reliable data sources. When using data from other sources, it is important to make sure that the data is reliable. This means checking to see if the data has been collected and processed using sound methods. You may also want to look for independent verification of the data.
- Document your data collection and analysis process. Documenting your data collection and analysis process will help you to identify and address any potential errors or biases. It will also make it easier for others to understand your work and to reproduce your results.
Here are some additional tips:
- Use strong data governance policies and procedures. This includes having clear roles and responsibilities for data management, as well as processes for data access and security.
- Implement data quality checks at key points in the data processing and analysis pipeline. This will help you to identify and address data quality issues early on.
- Use data visualization tools to explore your data and identify any anomalies.
- Have your work reviewed by a colleague or other expert. This can help to identify any potential errors or biases in your analysis.
By following these tips, you can help to ensure that your data is accurate, complete, consistent, and reliable. This will lead to more accurate and reliable results from your data analysis.
Welcome back. In an earlier course, we talked about how
to prepare data in a way that helps you tell a meaningful story. Now let’s find out what comes next. Like all good tales, your data story will
be filled with characters, questions, challenges, conflict, and
hopefully a resolution. The trick is to avoid the conflict,
overcome the challenges and answer the questions. That’s what this course is all about. Here’s how we’ll do it. First, you’ll learn how to analyze
data for bias and credibility. This is very important because even
the most sound data can be skewed or misinterpreted. Then we’ll learn about the importance
of being good and bad. Yep, just like when we were kids. But in this case,
we’ll be exploring good data sources and learning how to steer clear
of their nemesis, bad data. After that, we’ll learn more about the
world of data ethics, privacy and access. As more and more data becomes available,
and the algorithms we create to use this data become more complex and
sophisticated, new issues keep popping up. We need to ask questions like,
who owns all this data? How much control do we have
over the privacy of data? Can we use and
reuse data however we want to? As a data analyst, it’s important to
understand data ethics and privacy because in your work, you’ll make a lot of
judgment calls on the correct use and application of data. I’m excited to walk you through
some of the questions, answers, risks and rewards involved. Let’s open up the first chapter of
this data story in our next video.
Video: Bias: From questions to conclusions
- Bias in everyday life
- Bias is a preference in favor of or against a person, group of people, or thing. It can be conscious or subconscious.
- We run into bias all the time in everyday life.
- Bias in data
- Data bias is a type of error that systematically skews results in a certain direction.
- Bias can happen in the way data is collected, analyzed, or presented.
- The importance of identifying and addressing bias
- Bias can have a significant impact on the accuracy and reliability of data.
- It is important to identify and address bias in data in order to make informed decisions.
Here are some additional details about bias in data:
- There are many different types of bias, including:
- Selection bias: This occurs when the data is not representative of the population it is supposed to represent. For example, if you are trying to determine the average height of people in the United States, but you only survey people who are over 6 feet tall, then your results will be biased.
- Measurement bias: This occurs when the way data is collected introduces errors into the data. For example, if you are trying to measure the weight of people, but you use a scale that is not calibrated correctly, then your results will be biased.
- Reporting bias: This occurs when people are more likely to report certain things than others. For example, if you are trying to determine how many people have been victims of crime, but people are more likely to report crimes that have been violent, then your results will be biased.
- There are a number of ways to identify and address bias in data. Some common methods include:
- Data cleaning: This involves identifying and removing errors from data.
- Data analysis: This involves using statistical methods to identify patterns in data.
- Data visualization: This involves creating charts and graphs to make data easier to understand.
- By identifying and addressing bias in data, you can make more informed decisions. This is important for businesses, governments, and individuals.
Bias: From questions to conclusions
Bias is a systematic error in the design, collection, analysis, or interpretation of data that can lead to incorrect or misleading results. Bias can be introduced at any stage of the data analysis process, from the questions that are asked to the conclusions that are drawn.
How bias can affect data analysis
Bias can affect data analysis in a number of ways. For example, bias can lead to:
- Incorrect or misleading results: If the data is biased, the results of the analysis will also be biased. This can lead to incorrect conclusions and decisions.
- A lack of generalizability: If the data is not representative of the population of interest, the results of the analysis may not be generalizable to the wider population.
- Ethical concerns: Bias can also lead to ethical concerns, such as discrimination or unfair treatment.
Examples of bias in data analysis
Here are some examples of bias in data analysis:
- Selection bias: This occurs when the data is not representative of the population of interest. For example, if you collect data from people who attend a conference, the data will be biased towards people who are interested in attending the conference and who are able to afford to attend.
- Confirmation bias: This occurs when you only look for data that confirms your existing beliefs. For example, if you believe that a certain drug is effective, you may only look for data that supports this belief.
- Outlier bias: This occurs when outliers are not properly handled. Outliers are data points that are significantly different from the rest of the data. If outliers are not properly handled, they can skew the results of the analysis.
How to avoid bias in data analysis
There are a number of things you can do to avoid bias in data analysis:
- Be aware of potential sources of bias. This includes understanding the different types of bias and the different stages of the data analysis process where bias can be introduced.
- Design your study carefully. This includes using a representative sample and designing your survey questions in a way that minimizes bias.
- Use appropriate statistical methods. There are a number of statistical methods that can be used to identify and mitigate bias.
- Have your work reviewed by others. This can help to identify any potential biases in your analysis.
Conclusion
Bias is a serious problem that can affect the results of data analysis. It is important to be aware of the potential sources of bias and to take steps to mitigate them. By following the tips above, you can help to ensure that your data analysis is unbiased and that your results are reliable and valid.
Additional tips:
- Use multiple data sources. This will help to reduce the impact of any bias in any one data source.
- Be transparent about your data collection and analysis methods. This will help others to understand your work and to assess the potential for bias.
- Be open to feedback. If someone points out a potential bias in your work, be willing to consider their feedback and make changes as needed.
By following these tips, you can help to ensure that your data analysis is unbiased and that your results are reliable and valid.
Let’s kick things off by
traveling back in time, well, in our minds at least. My real time machine’s
in the shop. Imagine you’re back
in middle school and you’ve entered a project
for the science fair. You worked hard for
weeks perfecting every element and they’re
about to announce the winners. You close your eyes, take a deep breath, and you hear them call your
name for second place. Bummer, you really wanted
that first-place trophy, but hey, you’ll take the
ribbon for recognition. The next day you learn the
judge was the winner’s uncle. How is that fair!? Can he really be expected
to choose a winner fairly when his own family member
is one of the contestants? He’s probably biased! Maybe his niece deserved
to win and maybe not. But the point is:
it’s very easy to make a case for bias
in that scenario. This is a super-simple example, but the truth is, we run into bias all the time
in everyday life. Our brains are
biologically designed to streamline thinking and
make quick judgments. Bias has evolved to
become a preference in favor of or against a person, group of people, or thing. It can be conscious
or subconscious. The good news is once we know and accept that we have bias, we can start to recognize
our own patterns of thinking and learn
how to manage it. It’s important to know
that bias can also find its way into
the world of data. Data bias is a type of error that systematically skews results
in a certain direction. Maybe the questions
on a survey had a particular slant to
influence answers, or maybe the sample
group wasn’t truly representative of the
population being studied. For example, if you’re going
to take the median age of the US patient population
with health insurance, you wouldn’t just use a sample of Medicare patients who
are 65 and older. Bias can also happen if a
sample group lacks inclusivity. For example, people with disabilities tend to
be under-identified, under-represented, or excluded in mainstream
health research. The way you collect data
can also bias a data set. For example, if you give people only a short time to
answer questions, their responses will be rushed. When we’re rushed, we
make more mistakes, which can affect the quality of our data and create
biased outcomes. As a data analyst, you have to think about bias and fairness from the moment
you start collecting data to the time you present
your conclusions. After all, those conclusions can have serious implications. Think about this: it’s been acknowledged that
clinical studies of heart health tend to include
a lot more men than women. This has led to women failing
to recognize symptoms and ultimately having
their heart conditions go undetected and untreated. That’s just one way bias can
have a very real impact. While we’ve come a long
way in recognizing bias, it still led to
you losing out to the judge’s niece at that
science competition. It’s still influencing
business decisions, health care choices and access, governmental action, and more. So we’ve still got work to do. Coming up, we’ll show you how to identify bias in the data itself, and explore some
scenarios when you may actually benefit from it.
Video: Biased and unbiased data
- Bias is a preference based on preconceived notions.
- Bias can be introduced into data at any stage of the data collection, analysis, or presentation process.
- Sampling bias occurs when a sample is not representative of the population as a whole.
- One way to avoid sampling bias is to use random sampling.
- Another way to identify sampling bias is to visualize the data.
- There are other types of bias, such as measurement bias and reporting bias.
- It is important to be aware of bias in order to make accurate and informed decisions.
Here are some additional details about bias:
- Sampling bias occurs when a sample is not representative of the population as a whole. This can happen for a number of reasons, such as:
- The sample is not randomly selected.
- The sample is too small.
- The sample is not representative of the population in terms of demographics, such as age, gender, or race.
- Measurement bias occurs when errors are introduced into the data during the measurement process. This can happen for a number of reasons, such as:
- The measurement instrument is not calibrated correctly.
- The measurement procedure is not followed correctly.
- The person collecting the data is not trained properly.
- Reporting bias occurs when people are more likely to report certain things than others. This can happen for a number of reasons, such as:
- People may be more likely to report positive things than negative things.
- People may be more likely to report things that they think the researcher wants to hear.
- People may be more likely to report things that they think will benefit them in some way.
- It is important to be aware of bias in order to make accurate and informed decisions. When you are aware of bias, you can take steps to mitigate its effects. For example, you can:
- Use random sampling to select a representative sample of the population.
- Use multiple data sources to corroborate your findings.
- Be aware of your own biases and how they might be affecting your interpretation of the data.
Biased and unbiased data in data analysis
Biased data
Biased data is data that is not representative of the population of interest. This can happen for a number of reasons, such as:
- Selection bias: This occurs when the data is collected from a non-representative sample of the population. For example, if you collect data from people who attend a conference, the data will be biased towards people who are interested in attending the conference and who are able to afford to attend.
- Measurement bias: This occurs when the data is collected in a way that introduces bias. For example, if you use a survey to collect data, the questions you ask and how you ask them can introduce bias.
- Reporting bias: This occurs when the data is reported in a way that is biased. For example, a news article might only report on data that supports its headline.
Unbiased data
Unbiased data is data that is representative of the population of interest and that has not been collected or reported in a way that introduces bias. This can be difficult to achieve, but there are a number of things you can do to reduce bias, such as:
- Use a representative sample. When collecting data, make sure to use a sample that is representative of the population of interest. This may involve using a stratified sampling technique or a random sampling technique.
- Use a well-designed survey. If you are using a survey to collect data, make sure that the questions are well-designed and that they do not introduce bias. You may want to pilot test your survey before using it to collect data.
- Report the data accurately. When reporting the data, make sure to report it accurately and in a way that does not introduce bias. This includes avoiding selective reporting and using appropriate statistical methods to analyze the data.
How to identify biased data
There are a number of ways to identify biased data. One way is to look for inconsistencies in the data. For example, if you are collecting data on customer satisfaction, and you find that the satisfaction rate is much higher for customers who purchase a certain product, this may be a sign of bias.
Another way to identify biased data is to compare it to other data sources. For example, if you are collecting data on student achievement, and you find that the achievement rates at your school are much higher than the achievement rates in the district, this may be a sign of bias.
How to deal with biased data
If you identify biased data, there are a number of things you can do. One option is to discard the data and collect new data. However, this may not always be possible or practical.
Another option is to try to mitigate the bias. This may involve using statistical methods to adjust the data or using other data sources to supplement the biased data.
Finally, you can also acknowledge the bias in your analysis and be transparent about how it may affect your results.
Conclusion
Biased data is a serious problem that can affect the results of data analysis. It is important to be aware of the potential sources of bias and to take steps to mitigate them. By following the tips above, you can help to ensure that your data analysis is unbiased and that your results are reliable and valid.
Additional tips:
- Be aware of your own biases. We all have biases, and it is important to be aware of them so that we can minimize their impact on our work.
- Get feedback from others. Ask someone else to review your data collection and analysis methods to identify any potential biases.
- Use common sense. If something seems too good to be true, it probably is. If you see data that seems to be too perfect or too consistent, be skeptical.
By following these tips, you can help to ensure that your data analysis is unbiased and that your results are reliable and valid.
An unbiased sample is representative of the population being measured. Which of the following helps ensure unbiased sampling?
Using random sampling during data collection
Using random sampling during data collection helps ensure unbiased sampling.
There are 50 students in a class. A data analyst wants to know if a majority of students like the instructor. They decide to survey the 15 students who earned an A in the class because these students were clearly paying attention to the instructor. Which of the following statements best describes this sample?
Biased
This is a biased sample because it only includes students who earned A’s. It’s not representative of the population.
Hello again. So far
we’ve learned that the biases we have as people can end up creating biased data, we’re biased when we have
preferences based on our own preconceived or even
subconscious notions. When data is biased, it can systematically
skew results in a certain direction,
making them unreliable. We covered this earlier using sampling bias as an example. Sampling bias is
when a sample isn’t representative of the
population as a whole. You can avoid this
by making sure the sample is chosen at random, so that all parts
of the population have an equal chance
of being included. If you don’t use random sampling
during data collection, you end up favoring one outcome. Here’s a simple
way to look at it. Let’s say there are 50
students in one class, and you want to know
if the majority of the class prefers
warm or cold weather. you decide to survey the
first 10 students you meet, and based on their responses, you determine that the entire
class prefers warm weather. But wait, there’s
some bias there. those first 10 people
were all women, so only women were
included in your survey. Your survey wasn’t a
fair representation of the entire class because
it didn’t include other identifiers across
the gender spectrum. If you’d use a more
randomized sample of the population that
included all genders, you’d have an unbiased sample. Unbiased sampling results
in a sample that’s representative of the
population being measured. Another great way to discover
if you’re working with unbiased data is to bring the results to life
with visualizations. In the class example
we just covered, you could visualize
the number of students in the class overall, and their gender identities
with a bar chart. You could then compare that to a similar bar chart showing
the students you surveyed. This will help you
easily identify any misalignment
with your sample. Okay, now that we know what bias looks like from a sampling
perspective, let’s explore some
other types of bias, and how to recognize them.
Video: Understanding bias in data
- Observer bias occurs when different people observe the same thing and see different things.
- Interpretation bias occurs when people interpret the same data in different ways.
- Confirmation bias occurs when people search for or interpret information in a way that confirms their preexisting beliefs.
- These three types of bias can all affect the way we collect and make sense of data.
- It is important to be aware of these biases so that we can take steps to mitigate their effects.
Here are some additional details about each type of bias:
- Observer bias can occur in any situation where two or more people are observing the same thing. For example, if two scientists are looking at the same bacteria under a microscope, they may see different things because they have different levels of experience, different expectations, or different biases.
- Interpretation bias can occur when people are trying to make sense of data. For example, if two people are looking at the same set of sales figures, they may interpret the data differently because they have different goals or different ways of thinking about sales.
- Confirmation bias can occur when people are looking for information that supports their preexisting beliefs. For example, if someone believes that a certain type of medication is effective, they may be more likely to remember information that supports that belief and less likely to remember information that contradicts it.
It is important to be aware of these biases so that we can take steps to mitigate their effects. Here are some tips for avoiding bias:
- Be aware of your own biases. The first step to avoiding bias is to be aware of your own biases. Take some time to think about your own beliefs, values, and experiences. How might these things affect the way you interpret data?
- Get multiple perspectives. When possible, try to get multiple perspectives on the data. This could involve talking to people with different backgrounds, experiences, or beliefs.
- Be open to new information. Be willing to consider new information, even if it contradicts your preexisting beliefs.
- Be critical of the data. Don’t just accept the data at face value. Ask questions about how the data was collected, analyzed, and presented.
Understanding bias in data
Bias is any systematic error in the design, collection, analysis, or interpretation of data that can lead to incorrect or misleading results. Bias can be introduced at any stage of the data analysis process, from the questions that are asked to the conclusions that are drawn.
There are many different types of bias, but some of the most common types in data analysis include:
- Selection bias: This occurs when the data is not representative of the population of interest. For example, if you collect data from people who attend a conference, the data will be biased towards people who are interested in attending the conference and who are able to afford to attend.
- Measurement bias: This occurs when the data is collected in a way that introduces bias. For example, if you use a survey to collect data, the questions you ask and how you ask them can introduce bias.
- Reporting bias: This occurs when the data is reported in a way that is biased. For example, a news article might only report on data that supports its headline.
- Confirmation bias: This occurs when you only look for data that confirms your existing beliefs. For example, if you believe that a certain drug is effective, you may only look for data that supports this belief.
- Outlier bias: This occurs when outliers are not properly handled. Outliers are data points that are significantly different from the rest of the data. If outliers are not properly handled, they can skew the results of the analysis.
Bias can have a significant impact on the results of data analysis. For example, if the data is biased, the results of the analysis will also be biased. This can lead to incorrect conclusions and decisions.
How to identify bias in data
There are a number of things you can do to identify bias in data. One way is to look for inconsistencies in the data. For example, if you are collecting data on customer satisfaction, and you find that the satisfaction rate is much higher for customers who purchase a certain product, this may be a sign of bias.
Another way to identify bias in data is to compare it to other data sources. For example, if you are collecting data on student achievement, and you find that the achievement rates at your school are much higher than the achievement rates in the district, this may be a sign of bias.
How to mitigate bias in data
Once you have identified bias in data, there are a number of things you can do to mitigate it. One option is to discard the data and collect new data. However, this may not always be possible or practical.
Another option is to try to mitigate the bias. This may involve using statistical methods to adjust the data or using other data sources to supplement the biased data.
Finally, you can also acknowledge the bias in your analysis and be transparent about how it may affect your results.
Conclusion
Understanding bias in data is essential for conducting accurate and reliable data analysis. By being aware of the different types of bias and taking steps to mitigate them, you can help to ensure that your results are trustworthy and unbiased.
Additional tips:
- Be aware of your own biases. We all have biases, and it is important to be aware of them so that we can minimize their impact on our work.
- Get feedback from others. Ask someone else to review your data collection and analysis methods to identify any potential biases.
- Use common sense. If something seems too good to be true, it probably is. If you see data that seems to be too perfect or too consistent, be skeptical.
By following these tips, you can help to ensure that your data analysis is unbiased and that your results are reliable and valid.
I may be biased, but I think learning
about the good, and the bad traits of data,
is pretty fascinating. Next up, we’ll
discover that there’s lots of different
types of data bias, besides sampling bias,
which we covered earlier. As a quick refresher,
sampling bias, is when a sample isn’t representative of the
population as a whole. For example, if you’re doing
research on commuters, and only survey people
walking by in the sidewalk, you’ll miss out on input from
people who ride bicycles, drive, or take the subway. You need all sides of the
story to avoid sampling bias. In this video, we’ll explore three more types of data bias, observer bias,
interpretation bias, and confirmation bias, and
we’ll learn how to avoid them. Let’s start with observer bias, which is sometimes referred to as experimenter bias
or research bias. Basically, it’s the tendency for different people to observe
things differently. You might remember earlier, we learned that scientists use observations a lot in their work, like when they’re
looking at bacteria under a microscope
to gather data. While two scientists looking
into the same microscope might see different things,
that’s observer bias. Another time observer bias might happen is during manual
blood pressure readings. Because the pressure
meter is so sensitive, health care workers often get
pretty different results. Usually, they’ll just round up to the nearest whole number to compensate for the
margin of error. But if doctors
consistently round up, or down the blood pressure
readings on their patients, health conditions may be missed, and any studies
involving their patients wouldn’t have precise,
and accurate data. Another common type of data
bias is interpretation bias. The tendency to always interpret ambiguous situations
in a positive, or negative way.
Here’s an example. Let’s say you’re having
lunch with a colleague, when you get a voicemail
from your boss, asking you to call her back. You put the phone down in a huff, certain that she’s angry, and you’re on the hot
seat for something. But when you play the
message for your friend, he doesn’t hear anger at all, he actually thinks she sounds
calm and straightforward. Interpretation bias, can lead to two people seeing or hearing
the exact same thing, and interpreting it in a
variety of different ways, because they have different
backgrounds, and experiences. Your history with your boss made you interpret
the call one way, while your friend
interpreted it in another way, because
they’re strangers. Add these interpretations
to a data analysis, and you can get bias results. The last type of
bias we’ll cover, reminds me of the saying, people see what they want to see. That pretty much sums up
confirmation bias in a nutshell. Confirmation bias, is the
tendency to search for, or interpret information in a way that confirms
preexisting beliefs. Someone might be so eager
to confirm a gut feeling, that they only notice
things that support it, ignoring all other signals. This happens all the
time in everyday life. We might get our news from a certain website because the
writers share our beliefs, or we socialize
with people because we know that they
hold similar views. After all, conflicting viewpoints might cause us to
question our worldview, which can lead us to changing
our whole belief system, and let’s face, it,
change is tough. But you know what’s even tougher? Doing good work when
you have bad data, so it’s important to
keep bias out of it. The four types of
data bias we covered, sampling bias, observer bias, interpretation bias,
and confirmation bias, are all unique, but they do
have one thing in common. They each affect
the way we collect, and make sense of the data. Unfortunately, they’re
also just a small sample, pun intended, of the
types of bias you may encounter in your
career as a data analyst. But the good news is,
once you know a few, you’ll find yourself
constantly on guard for bias in any form. It’s also important to remember, that no matter what
kind of data you use, all of it needs to be inspected for accuracy, and
trustworthiness. We’ll talk more about
that soon when we start exploring bad data. Bye for now.
Practice Quiz: Test your knowledge on unbiased and objective data
Which of the following are examples of sampling bias? Select all that apply.
clinical study includes three times more men than women.
A survey of high-school-age students does not include homeschooled students.
A national election poll only interviews people with college degrees.
A survey of high-school-age students that does not include homeschooled students, a national election poll that only interviews people with college degrees, and a clinical study that includes three times more men than women are not representative of the population.
Fill in the blank: The tendency to search for or interpret information in a way that validates pre-existing beliefs is _____ bias.
confirmation
The tendency to search for or interpret information in a way that validates pre-existing beliefs is confirmation bias.
Which of the following terms are also ways of describing observer bias? Select all that apply.
Research bias
Experimenter bias
Observer bias is sometimes referred to as experimenter bias or research bias.
Explore data credibility
Video: Identifying good data sources
To identify good data sources, follow the ROCCC acronym:
- Reliable: The data source should be trustworthy and provide accurate, complete, and unbiased information.
- Original: Get the data from the original source whenever possible.
- Comprehensive: The data source should contain all critical information needed to answer the question or find the solution.
- Current: The data should be up-to-date and relevant to the task at hand.
- Cited: The data source should be properly cited, so that you can assess its credibility and reliability.
Some good data sources include vetted public data sets, academic papers, financial data, and governmental agency data.
Bad data can be inaccurate, incomplete, biased, outdated, or uncited. To avoid bad data, be critical of the data source and the data itself. Ask yourself who created the data, why it was created, and how it was collected. Also, look for evidence of bias and make sure the data is up-to-date.
Here are some additional tips for finding good data sources:
- Look for data sources that are created by reputable organizations.
- Check the data source’s documentation to see how the data was collected and processed.
- Look for data that is updated regularly.
- Be aware of potential biases in the data.
- Cross-check the data with other sources.
By following these tips, you can find good data sources that will help you make informed decisions.
Identifying good data sources in data analysis
Good data is essential for good data analysis. If you start with bad data, your results will be unreliable and misleading. That’s why it’s important to be able to identify good data sources.
Here are some tips for identifying good data sources:
- Consider the source. Who created the data? What is their reputation? Are they experts in the field? Do they have a vested interest in the outcome of your analysis?
- Assess the data collection methods. How was the data collected? Was it collected using rigorous scientific methods? Was it collected from a representative sample?
- Evaluate the data quality. Is the data accurate? Is it complete? Is it consistent? Is it free of errors?
- Check the data for bias. Is the data biased in any way? For example, was it collected only from a certain demographic group?
- Consider the timeliness of the data. Is the data up-to-date? Is it relevant to your analysis?
Here are some additional tips:
- Look for data sources that are well-documented. The documentation should explain how the data was collected and processed.
- Look for data sources that are widely used and cited by other researchers.
- Be wary of data sources that are too good to be true. If a data source seems too perfect, it probably is.
Here are some examples of good data sources:
- Government agencies
- Universities and research institutions
- Reputable businesses and organizations
- Peer-reviewed journals and academic papers
Once you have identified some potential data sources, it’s important to evaluate them carefully to make sure that they are appropriate for your needs. Consider the factors listed above, and choose the data sources that are most reliable, accurate, and relevant.
Here are some additional tips for evaluating data sources:
- Compare data from different sources. If the data from different sources is consistent, then you can be more confident in its accuracy.
- Look for data sources that have been quality-checked. Some data sources are audited or reviewed by experts to ensure their quality.
- Consider the cost of the data. Some data sources are free, while others require a subscription fee. Choose a data source that is affordable and meets your needs.
By following these tips, you can identify and evaluate good data sources for your data analysis projects.
Hey, what’s good!? No, really, I want to know: What is good? Let me put it this way. If I asked you to
name a good song, I might not like it. That’s because good is subjective. What I think is good
and what you think is good might be different. So what about good data sources? Are those subjective, too? In some ways they
are, but luckily, there’s some best practices
to follow that’ll help you measure the reliability of
data sets before you use them. That’s what we’ll
discuss in this video. I think we can all agree
that we all want good data. The more high quality
data we have, the more confidence we can
have in our decisions. Let’s learn how we can
go about finding and identifying good data sources. First things first, we need to learn how to identify them. A process I like to
call ROCCC, R-O-C-C-C. Okay. I just made that up, but I think acronyms
are a really great way to help new information
to stick in the brain. Kicking things off is
R for reliable. Like a good friend, good
data sources are reliable. With this data you can trust that you’re
getting accurate, complete and unbiased information that’s been vetted and
proven fit for use. Okay. Onto O. O is for original. There’s a good chance
you’ll discover data through a second or
third party source. To make sure you’re
dealing with good data, be sure to validate it
with the original source. Time for the first C. C
is for comprehensive. The best data sources contain all critical
information needed to answer the question
or find the solution. Think about it like this. You wouldn’t want to work for
a company just because you found one great online
review about it. You’d research every aspect of the organization to make
sure it was the right fit. It’s important to do the
same for your data analysis. The next C is for current. The usefulness of data
decreases as time passes. If you wanted to invite all current clients
to a business event, you wouldn’t use a
10-year-old client list. The same goes for data. The best data sources are current and relevant to the task at hand. The last C is for cited. If you’ve ever told a
friend where you heard that a new movie sequel
was in the works, you’ve cited a source. Citing makes the information you’re providing more credible. When you’re choosing
a data source, think about three things. Who created the data set? Is it part of a
credible organization? When was the data last refreshed? If you have original data from a reliable organization
and it’s comprehensive, current, and cited, it ROCCCs! There’s lots of places that are known for having good data. Your best bet is to go with
the vetted public data sets, academic papers, financial data, and governmental agency data. Now that you know how to spot
the good data, which ROCCCs, you’re ready to learn
about the mountain of bad data and how to avoid
it. Let’s get moving.
Video: What is “bad” data?
Bad data sources
Bad data sources are those that are not reliable, original, comprehensive, current, or cited. They can be inaccurate, incomplete, biased, outdated, or uncited.
Here are some examples of bad data sources:
- Websites and blogs with no author information or references
- Opinion polls with small sample sizes or biased questions
- Social media posts
- Anecdotal evidence
- Data that is out of date or irrelevant
- Data that is not cited properly
How to avoid bad data sources
- Be critical of the data source. Consider who created the data, why it was created, and how it was collected.
- Look for data sources that are created by reputable organizations.
- Check the data source’s documentation to see how the data was collected and processed.
- Look for data that is updated regularly.
- Be aware of potential biases in the data.
- Cross-check the data with other sources.
Conclusion
It is important for data analysts to understand and be able to identify bad data sources. Bad data can lead to incorrect conclusions and poor decision-making.
Some good data sources include vetted public data sets, academic papers, financial data, and governmental agency data.
What is “bad” data in Data analysis?
Bad data is data that is inaccurate, incomplete, biased, outdated, or uncited. It can lead to incorrect conclusions and poor decision-making.
Here are some examples of bad data:
- Data that contains errors, such as typos or miscalculations
- Data that is missing important information
- Data that is biased, such as data that only represents a certain subgroup of the population
- Data that is outdated and no longer relevant
- Data that is not cited properly, so it is difficult to assess its credibility
How to identify bad data
There are a few things you can look for to identify bad data:
- Errors: Check the data for typos, miscalculations, and other errors.
- Missing information: Look for missing values in the data.
- Bias: Consider the source of the data and whether it is likely to be biased.
- Outdated data: Check the date of the data to make sure it is still relevant.
- Citation: Make sure the data is properly cited so that you can assess its credibility.
How to avoid bad data
There are a few things you can do to avoid bad data:
- Use reputable sources: Get data from reputable sources, such as government agencies, universities, and peer-reviewed journals.
- Be aware of bias: Be aware of the potential for bias in the data and take steps to mitigate it.
- Clean the data: Clean the data to remove errors and missing values.
- Keep the data up-to-date: Keep the data up-to-date by refreshing it regularly.
Conclusion
It is important for data analysts to be able to identify and avoid bad data. Bad data can lead to incorrect conclusions and poor decision-making. By following the tips above, you can improve the quality of your data analysis and make better decisions.
Here are some additional tips for avoiding bad data:
- Cross-check the data with other sources.
- Use data visualization tools to identify anomalies and outliers in the data.
- Use statistical methods to test the data for bias and other problems.
By taking these steps, you can help to ensure that your data is reliable and accurate.
Welcome back. Last time we met, we learned
how to identify and find good data sources. A process
I ended up coining ROCCC. We found that if the data set is reliable,
original, comprehensive, current and cited, it
ROCCCs (or more seriously: it’s good). Hopefully this is refreshing your memory. Now it’s time to pull from what
we learned about good data and apply it to today’s lesson: bad data sources that don’t ROCCC. They’re not reliable, original,
comprehensive, current or cited. Even worse, they could be flat-out
wrong or filled with human error. We’ll start again with R. R is for not reliable. Bad data can’t be trusted because
it’s inaccurate, incomplete, or biased. This could be data that has sample
selection bias because it doesn’t reflect the overall population. Or it could be data visualizations and
graphs that are just misleading. Check out these 2 bar graphs, for example. The one on the left uses a
y-axis starting point of 3.14%. And the one on the right uses 0. This makes it seem like interest
rates have skyrocketed over a four year period when they’ve actually
remained pretty flat. Okay, onto O. O is for not original. If you can’t locate the original data
source and you’re just relying on second or third party information, that can signal
you may need to be extra careful in understanding your data. Now, C is for not comprehensive. Bad data sources are missing important
information needed to answer the question or find the solution. What’s worse, they may
contain human error, too. The next C is for not current. Bad data
sources are out of date and irrelevant. Many respected sources refresh their data
regularly, giving you confidence that it’s the most current info available. For example,
you can always trust Data.gov, which is home to the U.S.
government’s open data. The last C is for not cited. If your source hasn’t
been cited or vetted, it’s a no-go. So to sum up, good data should be original data
from a reliable organization, comprehensive, current, and cited. It should ROCCC! Otherwise, it’s bad data. If you need a great reliable data source,
check out the U.S. Census Bureau, which regularly updates
their information. It’s important for data analysts to
understand and keep an eye out for bad data because it can have
serious and lasting impacts. Whether it’s an incorrect conclusion
leading to one bad business decision, or inaccurate information causing processes
to fail and putting populations at risk, every good solution is
found by avoiding bad data. For good data,
stick with vetted public data sets, academic papers, financial data and
governmental agency data. And with that, we’ve come to the end of
our adventure with bias and credibility. After a few more exercises,
you’ll be ready for what lies ahead. I look forward to your progress.
Practice Quiz: Test your knowledge on data credibility
Which of the following are usually good data sources? Select all that apply.
Vetted public datasets
Academic papers
Governmental agency data
Vetted public datasets, academic papers, and governmental agency data are usually good data sources.
To determine if a data source is cited, you should ask which of the following questions? Select all that apply.
Who created this dataset?
Is this dataset from a credible organization?
“Is this dataset from a credible organization?” and “Who created this dataset?” are questions that can help you determine if a data source is cited.
A data analyst is analyzing sales data for the newest version of a product. They use third-party data about an older version of the product. For what reasons is this inappropriate for their analysis? Select all that apply.
The data is not current
The data is not original
Third-party data about an older version of the product is inappropriate because it is not original or current.
Data ethics and privacy
Video: Introduction to data ethics
Data ethics is a branch of ethics that deals with the ethical implications of data collection, storage, processing, and use. It is a new and evolving field, as the ability to collect, share, and use data in large quantities is relatively new.
There are six different aspects of data ethics:
- Ownership: Individuals own the raw data they provide, and they have primary control over its usage, how it’s processed, and how it’s shared.
- Transaction transparency: All data processing activities and algorithms should be completely explainable and understood by the individual who provides their data.
- Consent: Individuals have the right to know explicit details about how and why their data will be used before agreeing to provide it.
- Currency: Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions.
- Privacy: Individuals have the right to control who has access to their personal data and how it is used.
- Openness: Individuals have the right to know how their data is being used and to challenge or change the way it is being used.
Data ethics is important because it helps to protect individuals from the potential harms of data use, such as discrimination, privacy violations, and financial exploitation. It also helps to ensure that data is used in a fair and responsible way.
The video also highlights the importance of data privacy and openness, which will be discussed in more detail in future videos.
Introduction to Data Ethics in Data Analysis
Data ethics is a branch of ethics that deals with the ethical implications of data collection, storage, processing, and use. It is a new and evolving field, as the ability to collect, share, and use data in large quantities is relatively new.
Data ethics is important in data analysis because it helps to ensure that data is used in a fair, responsible, and ethical way. This is important for protecting individuals from the potential harms of data use, such as discrimination, privacy violations, and financial exploitation.
Here are some key principles of data ethics that data analysts should be aware of:
- Transparency: Data analysts should be transparent about how they collect, store, process, and use data. This includes informing individuals about what data is being collected, how it will be used, and how long it will be stored.
- Consent: Data analysts should obtain consent from individuals before collecting or using their data. Consent should be informed and freely given.
- Fairness: Data analysts should use data in a fair and unbiased way. This means avoiding data sets and algorithms that are biased against certain groups of people.
- Privacy: Data analysts should protect the privacy of individuals. This means taking steps to ensure that data is not accessed or used without authorization.
- Accountability: Data analysts should be accountable for the way they use data. This means being responsible for the consequences of data use, both positive and negative.
Here are some specific things that data analysts can do to promote data ethics:
- Use clear and concise language when communicating with data subjects about data collection and use. Avoid using jargon or technical terms that data subjects may not understand.
- Provide data subjects with the opportunity to opt out of data collection or use. This can be done by providing a clear and easy-to-use unsubscribe process.
- Use data sets and algorithms that have been vetted for bias. There are a number of organizations that provide resources for identifying and mitigating bias in data.
- Implement appropriate security measures to protect data from unauthorized access and use. This includes using encryption, strong passwords, and access controls.
- Regularly review data collection and use practices to ensure that they are aligned with data ethics principles.
Data ethics is a complex and nuanced topic, but it is important for data analysts to be aware of the key principles and practices. By following these guidelines, data analysts can help to ensure that data is used in a way that is fair, responsible, and ethical.
Hi again, let me
ask you something. What comes to your mind when you think of the word, ethics? For me, it’s a set of
principles to live by. Most people have a
personal code of ethics that helps them
navigate the world. When we’re young, it
could be as simple as never lie, cheat or steal, but as we get older, it’s a much broader list
of do’s and don’ts. Our personal ethics evolve
and become more rational, giving us a moral compass
to use as we face life’s questions, challenges,
and opportunities. When we analyze data, we’re also faced with
questions, challenges, and opportunities, but
we have to rely on more than just our personal code
of ethics to address them. As we learned earlier, we all have our own
personal biases, not to mention
subconscious biases that make ethics even more
difficult to navigate. That’s why we have data ethics, an important aspect of analytics that we’ll explore right
here in this video. But first, let’s go back to
the general idea of ethics. While an exact definition is still under discussion
in philosophy, one practical view is
that ethics refers to well-founded standards
of right and wrong that prescribe
what humans ought to do, usually in terms of
rights, obligations, benefits to society, fairness
or specific virtues. Just like humans, data has standards to live up to as well. Data ethics refers to well-
founded standards of right and wrong that dictate how data is collected, shared, and used. Since the ability to collect, share and use data in such large quantities
is relatively new, the rules that
regulate and govern the process are still evolving. The importance of
data privacy has been recognized by governments
worldwide and they started creating data
protection legislation to help protect people
and their data. The GDPR of the European Union was created to do just this. While policy makers
continue their work, companies like Google have a responsibility to lead the
effort and will do so in the same spirit we
always have by offering products that make privacy
a reality for everyone. The concept of data ethics
and issues related to transparency and privacy
are part of the process. Data ethics tries to
get to the root of the accountability
companies have in protecting and responsibly
using the data they collect. There are lots of
different aspects of data ethics but we’ll
cover six: ownership, transaction transparency, consent, currency,
privacy, and openness. We’ll explore data privacy
and openness a bit later. First up is ownership. This answers the
question who owns data? It isn’t the organization that invested time and
money collecting, storing, processing,
and analyzing it. It’s individuals who own
the raw data they provide, and they have primary
control over its usage, how it’s processed
and how it’s shared. Next, we have transaction
transparency, which is the idea that all data processing activities
and algorithms should be completely explainable and understood by the individual
who provides their data. This is in response to
concerns over data bias, which we discussed earlier, is a type of error that systematically skews results
in a certain direction. Biased outcomes can lead to
negative consequences. To avoid them, it’s
helpful to provide transparent analysis
especially to the people who share their data. This lets people judge whether
the outcome is fair and unbiased and allows them to
raise potential concerns. Now let’s talk about
another aspect of data ethics, consent. This is an individual’s right to know explicit details about how and why their data will be used before agreeing
to provide it. They should know
answers to questions like why is the data
being collected? How will it be used? How long will it be stored? The best way to give consent
is probably a conversation between the person providing the data and the
person requesting it. But with so much activity
happening online these days, consent usually just
looks like a terms and conditions checkbox with
links to more details. Let’s face it, not everyone clicks through to
read those details. Consent is important because it prevents all populations
from being unfairly targeted which is a very big
deal for marginalized groups who are often disproportionately misrepresented by biased data. Next, there’s currency. Individuals should be aware of financial
transactions resulting from the use of
their personal data and the scale of
these transactions. If your data is helping to
fund a company’s efforts, you should know what
those efforts are all about and be given the
opportunity to opt out. The last two aspects
of data ethics, privacy and openness, deserve their own spotlight
on this data stage. Coming up, you’ll see why.
Video: Optional Refresher: Alex: The importance of data ethics
Alex and his team at Google are concerned about how AI interacts with society and how it might affect marginalized communities. They believe that data ethics is about using data in a way that is good and right, and that benefits people.
Alex raises some important questions about data ethics, such as:
- Who is collecting the data?
- Why are they collecting it?
- How are they collecting it?
- For what purpose?
He also emphasizes the importance of keeping in mind how data collection and use will benefit people, and how it will affect the people represented in the data.
Alex also discusses the importance of protecting data privacy and giving users more control over their data. He believes that people should be able to consent to giving their data, and that they should be able to ask for their data to be revoked or removed.
Alex concludes by saying that data is growing, and that these issues are becoming more and more important to think about.
Additional thoughts:
Alex’s video provides a good overview of some of the key issues in data ethics. It is important to remember that data ethics is a complex field, and there are no easy answers to many of the questions that Alex raises. However, it is important to be aware of these issues and to think carefully about the ethical implications of our work with data.
Here are some additional thoughts on the issues raised in the video:
- Transparency: It is important to be transparent about how data is collected, used, and stored. This includes informing people about what data is being collected, how it will be used, and how long it will be stored.
- Consent: People should be able to consent to giving their data. This means that they should be informed about how their data will be used, and they should have the option to opt out of data collection.
- Fairness: Data should be used in a fair and unbiased way. This means avoiding data sets and algorithms that are biased against certain groups of people.
- Privacy: People’s privacy should be protected. This means taking steps to ensure that data is not accessed or used without authorization.
- Accountability: Data scientists and analysts should be accountable for the way they use data. This means being responsible for the consequences of data use, both positive and negative.
By following these principles, we can help to ensure that data is used in a way that is ethical and beneficial to society.
Hi, I’m Alex. I’m a research
scientist at Google. My team is called
the Ethical AI Team. We’re a group of folks that
really are concerned not only about how AI, the
technology operates, but how it interacts
with society and how it might help or harm
marginalized communities. So when we talk about data ethics, we think about, What is the good and right
way of using data? What are going to be ways that are going to be uses of data that are going to be
beneficial to people? When it comes to
data ethics it’s not just about minimizing harm, but it’s actually this
concept of beneficence. How do we actually
improve the lives of people by using data? When we think about
data ethics we’re thinking about who’s
collecting the data? Why are they collecting it? How are they collecting
it? And for what purpose? Because of the way that organizations have
imperatives to make money, or to report to somebody,
or provide some analysis, we also have to keep
strongly in mind how this is actually going to benefit people at
the end of the day. Are the people represented in this data going to be
benefited by this? I think that’s the
thing you never want to lose sight of as a data
scientist or a data analyst. I think aspiring
data analysts need to keep in mind that a lot of the data that you’re going to encounter is data that
comes from people. So at the end of the day, data are people. And
you want to have a responsibility to those people that are represented
in those data. Second, is thinking
about how to keep aspects of their data
protected and private. We don’t want to go through
our practice thinking about data instances
as something we can just throw on the web. No, there needs to be
considerations about how to keep that information and likenesses, like their images, or their voices or their text. How do we keep that private? We also need to think
about how we can have mechanisms of giving users and giving consumers more
control over their data. It’s not going to be
sufficient just to say, we collected all this data and trust us with all these data, but we need to
ensure that there’s actionable ways in
which people can consent to giving those
data and ways that they can ask for it to
be revoked or removed. Data’s growing, and
at the same time, we need to empower people to have control over their own data. The future is that data
is always growing. We haven’t seen any kind of evidence that data is actually shrinking. And with the knowledge
that data’s growing, these issues become more and more piqued and more and more
important to think about.
Video: Introduction to data privacy
Data privacy is the right of individuals to control how their personal data is collected, used, and shared. It is important because it protects people from unauthorized access to their data, inappropriate use of their data, and legal harm.
Companies have a responsibility to put privacy measures in place to protect the data of their customers and employees. This includes collecting only the data that is necessary for a specific purpose, obtaining consent before using data, and taking steps to secure data from unauthorized access and use.
Data privacy is a complex issue, and there are many different perspectives on what it means and how it should be protected. However, it is important for everyone to be aware of their data privacy rights and to take steps to protect their own data.
Additional thoughts:
Data privacy is becoming increasingly important in today’s digital world. As we share more and more data online, it is important to be aware of the risks and to take steps to protect ourselves.
Here are some tips for protecting your data privacy:
- Be careful about what information you share online. Only share information with websites and apps that you trust.
- Read the privacy policies of websites and apps before you use them. This will help you to understand how your data will be used.
- Use strong passwords and enable two-factor authentication on your online accounts.
- Be careful about clicking on links in emails and text messages. Phishing attacks are a common way for scammers to steal personal information.
- Keep your software up to date. Software updates often include security patches that can help to protect your data from hackers.
By following these tips, you can help to protect your data privacy and reduce the risk of identity theft and other scams.
Introduction to Data Privacy in Data Analysis
Data privacy is the right of individuals to control how their personal data is collected, used, and shared. It is important for data analysts to be aware of data privacy principles and practices in order to protect the privacy of the individuals whose data they are analyzing.
Here are some key principles of data privacy that data analysts should be aware of:
- Transparency: Data analysts should be transparent about how they collect, use, and store data. This includes informing individuals about what data is being collected, how it will be used, and how long it will be stored.
- Consent: Data analysts should obtain consent from individuals before collecting or using their data. Consent should be informed and freely given.
- Fairness: Data analysts should use data in a fair and unbiased way. This means avoiding data sets and algorithms that are biased against certain groups of people.
- Privacy: Data analysts should protect the privacy of individuals. This means taking steps to ensure that data is not accessed or used without authorization.
- Accountability: Data analysts should be accountable for the way they use data. This means being responsible for the consequences of data use, both positive and negative.
Here are some specific things that data analysts can do to promote data privacy:
- Only collect the data that is necessary for the specific purpose of the analysis.
- Anonymize or de-identify data whenever possible.
- Use secure methods to store and transmit data.
- Implement access controls to restrict access to data.
- Dispose of data properly when it is no longer needed.
Data analysts also play an important role in educating the public about data privacy. They can do this by writing blog posts, giving talks, and creating other educational resources.
Here are some additional tips for data analysts who want to protect data privacy:
- Be aware of the privacy laws and regulations that apply to your work.
- Conduct a privacy impact assessment before starting a new data analysis project.
- Get approval from a data privacy officer before collecting or using sensitive data.
- Work with a team of experts, including legal and information security professionals, to ensure that your data privacy practices are sound.
By following these tips, data analysts can help to protect the privacy of the individuals whose data they are analyzing and build trust with the public.
We’ve been exploring
some important aspects of data ethics, and one of the most personal
areas involves privacy. Privacy is personal. We may all define
privacy in our own way, and we’re all entitled to it. Whether it’s family
members wanting privacy when using
a shared computer, a teenager wanting to share a selfie with only
specific people, or a company wanting to keep their customers’ credit
card info secure, we’re all concerned how our
data is used and shared. Data privacy is big
in today’s culture, so let’s explore it fully. When talking about data, privacy means preserving a
data subject’s information and activity any time a
data transaction occurs. This is sometimes
called information privacy or data protection. It’s all about access, use, and collection of data. It also covers a person’s
legal right to their data. This means someone like
you or me should have protection from unauthorized
access to our private data, freedom from inappropriate
use of our data, the right to inspect, update, or correct our data, ability to give consent
to use our data, and legal right to
access our data. For companies, it means putting privacy measures in place to protect the individuals’ data. Data privacy is important, even if you’re not
someone who thinks about it on a day-to-day basis. The importance of
data privacy has been recognized by governments
worldwide, and they’ve started creating
data protection legislation to help protect people
and their data. Being able to trust
companies with your data is important. It’s what makes people want
to use a company’s product, share their
information, and more. Trust is a really
big responsibility that can’t be taken lightly. The final aspect involving data ethics is one that’s
constantly being discussed. The idea of openness, free access, usage,
and sharing of data. We’ll cover that
in another video. You’re well on your
way to becoming an ethical data analyst.
Reading: Data anonymization
What is data anonymization?
You have been learning about the importance of privacy in data analytics. Now, it is time to talk about data anonymization and what types of data should be anonymized. Personally identifiable information, or PII, is information that can be used by itself or with other data to track down a person’s identity.
Data anonymization is the process of protecting people’s private or sensitive data by eliminating that kind of information. Typically, data anonymization involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values.
Your role in data anonymization
Organizations have a responsibility to protect their data and the personal information that data might contain. As a data analyst, you might be expected to understand what data needs to be anonymized, but you generally wouldn’t be responsible for the data anonymization itself. A rare exception might be if you work with a copy of the data for testing or development purposes. In this case, you could be required to anonymize the data before you work with it.
What types of data should be anonymized?
Healthcare and financial data are two of the most sensitive types of data. These industries rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why data in these two industries usually goes through de-identification, which is a process used to wipe data clean of all personally identifying information.
Data anonymization is used in just about every industry. That is why it is so important for data analysts to understand the basics. Here is a list of data that is often anonymized:
- Telephone numbers
- Names
- License plates and license numbers
- Social security numbers
- IP addresses
- Medical records
- Email addresses
- Photographs
- Account numbers
For some people, it just makes sense that this type of data should be anonymized. For others, we have to be very specific about what needs to be anonymized. Imagine a world where we all had access to each other’s addresses, account numbers, and other identifiable information. That would invade a lot of people’s privacy and make the world less safe. Data anonymization is one of the ways we can keep data private and secure!
Video: Andrew: The ethical use of data
Andrew believes that it is important to use AI responsibly to avoid amplifying or reinforcing unfair biases. He is concerned about the potential for AI systems to harm underrepresented communities and minority groups.
Andrew’s role is to help the larger community build socially responsible AI systems. He does this by working with research groups and product teams at Google, and by engaging with the larger community. He believes that it is important to educate people about the responsible use of AI, even if they are well-intentioned but may not have the resources or knowledge to do it on their own.
Andrew believes that AI has the potential to improve the lives of many people, and that it is important to use it responsibly. He sees his role as helping to democratize the responsible use of AI so that everyone can benefit from it.
Additional thoughts:
Andrew’s video is a good reminder that AI is a powerful tool that can be used for good or for bad. It is important to be aware of the potential risks of AI, and to take steps to mitigate those risks.
One way to do this is to ensure that AI systems are transparent and accountable. This means that people should be able to understand how AI systems work and to challenge the decisions that they make.
Another important way to ensure the responsible use of AI is to involve everyone in the conversation. This includes people from all walks of life, not just technologists. By working together, we can create AI systems that are beneficial to everyone.
[MUSIC] My name is Andrew. I am a senior developer advocate on
the ethical AI research group at Google. As a senior developer advocate, I try and help the larger community build
socially responsible AI systems. One consequence of not using
this technology responsibly is the possibility of amplifying or
reinforcing unfair biases. Now these algorithms, these data sets, are often being used in settings
where they are deciding the outcome. Whether it’s curating content for
an individual or determining whether or not they’re eligible for a loan, all these different decision making
processes depend on the algorithms and the data sets that are being
used in that context. And so if this were to be
handled irresponsibly, then the very outcomes of these
systems could potentially harm underrepresented communities,
minority groups. There’s a lot that the field,
the industry, the community, is learning about the responsible
use of data and AI. So what I try to do is I try to
correlate all those different elements, whether it’s working with various research
groups within Google, working with various product teams at Google,
engaging with the larger community. We have to go above and
beyond and actually educate those that are striving to
build this technology for good but may not necessarily
have the resources or the institutional community wisdom to
actually carry out their good intentions. So the truth of the matter is that
AI, data, and any technology that’s built around that, there’s a lot
of great benefits to that. It’s improving the lives
of many people out there. It’s enabling us to do things
we couldn’t ordinarily do. It’s providing us with affordances
to think about other things in life. This is all the more reason why
it’s important that we together, collectively, not just one organization,
but the entire community, and even non-technologists, too,
everyone needs to be involved. That’s the role of, that I play here,
is that I try to help AI evolve ethically together, and
to do that is contingent on the democratization of
the responsible use of AI. [MUSIC]
Practice Quiz: Test your knowledge on data ethics and privacy
Fill in the blank: _____ states that all data-processing activities and algorithms should be completely explainable and understood by the individual who provides their data.
Transaction transparency
Transaction transparency states that all data-processing activities and algorithms should be completely explainable and understood by the individual who provides their data.
A data analyst removes personally identifying information from a dataset. What task are they performing?
Data anonymization
They are performing data anonymization, which is the process of protecting people’s private or sensitive data by eliminating identifying information.
Before completing a survey, an individual acknowledges reading information about how and why the data they provide will be used. What is this concept called?
Consent
This concept is called consent. Consent is the aspect of data ethics that presumes an individual’s right to know how and why their personal data will be used before agreeing to provide it.
Understanding open data
Video: Features of open data
Open data is data that is freely available to anyone to use, reuse, and redistribute. It is important for data ethics because it allows for transparency, respect for privacy, and consent for data that is owned by others.
Open data has many benefits, including:
- It can be used to create credible databases that can be used more widely.
- It can be leveraged, shared, and combined with other data to improve scientific collaboration, research advances, analytical capacity, and decision-making.
- It can help to hold leaders accountable and provide better access to community services.
However, there are also challenges to open data, such as the need for interoperability and the resources required to make the technological shift to open data.
Despite the challenges, open data has the potential to transform society and how decisions are made.
Open data is data that is freely available to anyone to use, reuse, and redistribute. It is a valuable resource for data analysts because it can be used to create credible databases, leverage shared data, and make better decisions.
Here are some of the key features of open data that make it so useful for data analysis:
- Transparency: Open data is typically published with clear and complete metadata, which allows data analysts to understand the data and its provenance. This transparency is essential for ensuring the accuracy and reliability of the data.
- Reusability: Open data is typically published under open licenses, which allow data analysts to reuse and redistribute the data without restrictions. This flexibility allows data analysts to combine open data with other data sources to create new insights.
- Scalability: Open data is often available in large volumes, which can be used to train machine learning models and conduct other large-scale data analysis projects. This scalability is essential for addressing complex data challenges.
Here are some specific examples of how open data can be used for data analysis:
- Public health: Open data on disease outbreaks, vaccination rates, and other health indicators can be used to identify trends and patterns, track the spread of diseases, and develop targeted interventions.
- Economic development: Open data on economic indicators, such as GDP growth, unemployment rates, and poverty levels, can be used to assess the performance of economies and identify areas for improvement.
- Environmental protection: Open data on air quality, water quality, and other environmental indicators can be used to track environmental changes, identify pollution sources, and develop strategies for environmental protection.
- Social justice: Open data on crime rates, educational attainment, and other social indicators can be used to identify disparities and develop policies to address them.
Overall, open data is a powerful tool for data analysts that can be used to create credible databases, leverage shared data, and make better decisions.
Here are some additional tips for using open data for data analysis:
- Identify the right datasets: There are many different open datasets available, so it is important to identify the datasets that are most relevant to your research question.
- Clean and prepare the data: Once you have identified the relevant datasets, you will need to clean and prepare the data for analysis. This may involve removing duplicate records, correcting errors, and converting the data into a consistent format.
- Use appropriate analytical methods: Once the data is clean and prepared, you can use appropriate analytical methods to extract insights from the data.
- Share your findings: Once you have extracted insights from the data, it is important to share your findings with others. This can be done by publishing your findings in a journal article, giving a presentation, or creating a blog post.
By following these tips, you can use open data to conduct valuable data analysis projects that can make a positive impact on the world.
There’s just something so liberating about
being able to find information on any subject
at all on the Internet. Can’t remember the 3rd line of your favorite childhood song, curious who had the
most home runs in 1986, want to teach yourself
sign language? Just pop open your laptop, type up some text and poof,
you have what you need. Many groups think
we should also have this level of access to data. There’s even a global movement that believes the openness of data can transform society
and how decisions are made. So far, we’ve talked a lot
about the power of data and the importance of
data ethics concerns including ownership, transaction transparency,
consent, currency, and privacy. Now, let’s talk about openness. When referring to data, openness refers to free access, usage and sharing of data. Sometimes we refer to
this as open data, but it doesn’t mean we ignore the other aspects of
data ethics we covered. We should still be transparent, respect privacy,
and make sure we have consent for data
that’s owned by others. This just means we
can access, use, and share that data if it
meets these high standards. For example, there are standards around availability and access. Open data must be
available as a whole, preferably by downloading over the Internet in a convenient
and modifiable form. The website data.gov
is a great example. You can download science
and research data for a wide range of industries in simple file formats
like a spreadsheet. Another standard surrounds
reuse and redistribution. Open data must be provided under terms that allow reuse and redistribution
including the ability to use it with other datasets. And the last area is
universal participation. Everyone must be able to use, reuse, and
redistribute the data. There shouldn’t be any
discrimination against fields, persons, or groups. No one can place restrictions
on the data like making it only available for
use in a specific industry. Now let’s talk a little
more about why open data is such a great thing
and how it can help you as a data analyst. One of the biggest benefits
of open data is that credible databases can
be used more widely. More importantly, all of that
good data can be leveraged, shared, and combined
with other data. Just imagine the
impact that would have on scientific
collaboration, research advances, analytical capacity,
and decision-making. For example, in human health, openness allows us to
access and combine diverse data to detect
diseases earlier and earlier. In government, you
can help hold leaders accountable and provide better access to
community services. The possibilities and
benefits are almost endless. But of course, every big
idea has its challenges. A whole lot of resources
are needed to make the technological
shift to open data. Interoperability is key
to open data’s success. Interoperability is the
ability of data systems and services to openly
connect and share data. For example, data
interoperability is important for health care information
systems where multiple organizations
such as hospitals, clinics, pharmacies, and
laboratories need to access and share data to ensure patients get the
care that they need. This is why your
doctor is able to send your prescription directly
to your pharmacy to fill. They have compatible databases that allow them to
share information. But this kind of interoperability
requires a lot of cooperation. While there is
serious potential in the open, timely, fair, and simple sharing of data, its future will depend on how effectively larger
challenges are addressed. As a data analyst, I say the sooner the better. Speaking of which, we’re
going to talk more about open data and see its use in
action in an upcoming video. Now that you’ve learned
all about data ethics, you have some
important principles to guide you on
your data journey. Anytime you’re not
sure of your data, remember what you’ve
learned here. Happy Trails.
Reading: The open-data debate
Reading
Just like data privacy, open data is a widely debated topic in today’s world. Data analysts think a lot about open data, and as a future data analyst, you need to understand the basics to be successful in your new role.
What is open data?
In data analytics, open data is part of data ethics, which has to do with using data ethically. Openness refers to free access, usage, and sharing of data. But for data to be considered open, it has to:
- Be available and accessible to the public as a complete dataset
- Be provided under terms that allow it to be reused and redistributed
- Allow universal participation so that anyone can use, reuse, and redistribute the data
Data can only be considered open when it meets all three of these standards.
The open data debate: What data should be publicly available?
One of the biggest benefits of open data is that credible databases can be used more widely. Basically, this means that all of that good data can be leveraged, shared, and combined with other data. This could have a huge impact on scientific collaboration, research advances, analytical capacity, and decision-making. But it is important to think about the individuals being represented by the public, open data, too.
Third-party data is collected by an entity that doesn’t have a direct relationship with the data. You might remember learning about this type of data earlier. For example, third parties might collect information about visitors to a certain website. Doing this lets these third parties create audience profiles, which helps them better understand user behavior and target them with more effective advertising.
Personal identifiable information (PII) is data that is reasonably likely to identify a person and make information known about them. It is important to keep this data safe. PII can include a person’s address, credit card information, social security number, medical records, and more.
Everyone wants to keep personal information about themselves private. Because third-party data is readily available, it is important to balance the openness of data with the privacy of individuals.
Video: Andrew: Steps for ethical data use
As a data analyst, it is important to be aware of the ethical implications of your work. You should question your own motivations, consider the potential harms and risks of your work, and ensure that you are using the data responsibly. You should also be mindful of how you present your findings and how they will be used in decision-making. By taking a nuanced approach to your analysis and being cognizant of all the possible risks and harms, you can help to ensure that your work is used for good.
My name is Andrew. I am a Senior
Developer Advocate on the ethical AI research
group at Google. As an analyst, there’s quite a few things you can
do as you’re evaluating your dataset in
order to ensure that you’re looking at it through
the various ethical lenses. One of it is being
to self-reflect and understand what it is that you’re doing and the impact that it has. The best way to challenge that
is to question who we are. We being, like, okay, we
in this team are trying to build this
because we think that that’s going to help improve
this product or that’s going to help inform decisions about
what we want to do next. Think about not just those that sit
laterally next to you, but also think about those
that are represented in this dataset and those that aren’t represented
in this dataset, and then use that intuition to then continue to
question the integrity, the quality, the representation that is
present in that dataset. And then also, think about
the various harms and risks associated with
the work that you’re doing. For example, if you think that you’ll benefit from
keeping the dataset longer, you may want to also understand what’s the risk of
holding onto this dataset? What’s the potential
harm that could arise if you continue
to look at the dataset and continue to store it and continue to retrieve this data? And going beyond that, also understanding what’s
the consent process like. Are you informing those that
you’re collecting data from how it’s going to be used? What’s the communication
channel like? Putting on the various
ethical lenses, taking a more nuanced
approach to your analysis, being cognizant of all
the possible risks and harms that can arise when not just analyzing
your dataset, but also presenting your dataset. How you portray the results, how they’re being used in
the decision-making process, whether you are presenting
this to management, or presenting this to executives, or presenting this to
a larger audience. All of that matters in the responsible
use of the dataset. But as a data analyst, you stand in the
intersection between the very people that
will stand to benefit from the technology
that’s being developed and those in your
organization that are trying to make a more
informed decision as to whether or not to
move forward with the productionization
of the technology. It may feel like there’s a lot of weight there, and there is, but it’s also very
pivotal, and it speaks to the volume of the
impact of your work.
Reading: Sites and resources for open data
Overview
Luckily for data analysts, there are lots of trustworthy sites and resources available for open data. It is important to remember that even reputable data needs to be constantly evaluated, but these websites are a useful starting point:
- U.S. government data site: Data.gov is one of the most comprehensive data sources in the US. This resource gives users the data and tools that they need to do research, and even helps them develop web and mobile applications and design data visualizations.
- U.S. Census Bureau: This open data source offers demographic information from federal, state, and local governments, and commercial entities in the U.S. too.
- Open Data Network: This data source has a really powerful search engine and advanced filters. Here, you can find data on topics like finance, public safety, infrastructure, and housing and development.
- Google Cloud Public Datasets: There are a selection of public datasets available through the Google Cloud Public Dataset Program that you can find already loaded into BigQuery.
- Dataset Search: The Dataset Search is a search engine designed specifically for data sets; you can use this to search for specific data sets.
Practice Quiz: Hands-On Activity: Kaggle datasets
Practice Quiz: Test your knowledge on open data
What aspect of data ethics promotes the free access, usage, and sharing of data?
Openness
Openness is the aspect of data ethics that promotes the free access, usage, and sharing of data.
What are the main benefits of open data? Select all that apply.
Open data combines data from different fields of knowledge.
Open data makes good data more widely available.
The benefits of open data include making good data more widely available and combining data from different fields of knowledge.
Universal participation is a standard of open data. What are the key aspects of universal participation? Select all that apply.
No one can place restrictions on data to discriminate against a person or group.
Everyone must be able to use, re-use, and redistribute open data.
The key aspects of universal participation are that everyone must be able to use, reuse, and redistribute open data. Also, no one can place restrictions on data to discriminate against a person or group.
Weekly challenge
Reading: Glossary: Terms and definitions
Quiz: *Weekly challenge 2*
A clinic surveys a group of male and female patients about their experience with physical therapy. The survey does not include people with disabilities. Is the survey data biased?
Yes
Which type of bias is the tendency to always construe ambiguous situations in a positive or negative way?
Interpretation
A data analyst reviews a dataset. They conclude that the data is inaccurate and incomplete in some places. They also confirm that the data is biased. What type of data does this describe?
Unreliable data
In data ethics, consent gives an individual the right to know the answers to which of the following questions? Select all that apply.
How will my data be used?
Why is my data being collected?
How long will my data be stored?
Transactional transparency is a fundamental right for an individual who provides their data. Which of the following is included in this right? Select all that apply.
Knowing for how long the data will be used.
Understanding the algorithms to be used on the data.
Knowing the data-processing activities to be used on that data.
The right to inspect, update, or correct your own data is part of which aspect of data ethics?
Data privacy
Why would a company routinely use a data anonymizer when working with their users’ data?
To protect its users’ private and sensitive data by removing any identifying information
A government agency allows any business, nonprofit organization, or citizen to access the government’s databases and re-use or re-distribute the data. What type of data is this an example of?
Open data