Skip to content
Home » Google Career Certificates » Google Data Analytics Professional Certificate » Process Data from Dirty to Clean » Module 2: Sparkling-clean data

Module 2: Sparkling-clean data

Every data analyst wants clean data to work with when performing an analysis. In this part of the course, you’ll learn the difference between clean and dirty data. You’ll also explore data cleaning techniques using spreadsheets and other tools.

Learning Objectives

  • Differentiate between clean and dirty data
  • Explain the characteristics of dirty data
  • Describe data cleaning techniques with reference to identifying errors, redundancy, compatibility and continuous monitoring
  • Identify common pitfalls when cleaning data
  • Demonstrate an understanding of the use of spreadsheets to clean data

Data cleaning is a must


Video: Clean it up!

  • Cost of bad data: Billions of dollars annually in the US alone, often caused by human error.
  • Dirty data: Incomplete, incorrect, irrelevant data leading to unreliable results.
  • Clean data: Complete, correct, relevant data leading to smooth projects and reliable analysis.
  • Importance of clean data: Enables accurate analysis, prevents errors, and saves time.
  • Upcoming lessons: Tools and skills for cleaning data, differences between clean and dirty data, cleaning methodologies, and common data issues.

Can you guess what inaccurate or
bad data costs businesses every year? Thousands of dollars? Millions? Billions? Well, according to IBM, the yearly cost of poor-quality data
is $3.1 trillion in the US alone. That’s a lot of zeros. Now, can you guess the number
one cause of poor-quality data? It’s not a new system implementation or
a computer technical glitch. The most common factor
is actually human error. Here’s a spreadsheet from a law office. It shows customers,
the legal services they bought, the service order number, how much
they paid, and the payment method. Dirty data can be the result of someone
typing in a piece of data incorrectly; inconsistent formatting; blank fields; or the same piece of
data being entered more than once, which creates duplicates. Dirty data is data that’s incomplete,
incorrect, or irrelevant to the problem
you’re trying to solve. When you work with dirty data, you can’t
be sure that your results are correct. In fact,
you can pretty much bet they won’t be. Earlier, you learned that data
integrity is critical to reliable data analytics results, and clean data
helps you achieve data integrity. Clean data is data that’s complete,
correct, and relevant to the problem
you’re trying to solve. When you work with clean data, you’ll find
that your projects go much more smoothly. I remember the first time I witnessed
firsthand how important clean data really is. I had just started using SQL, and
I thought it worked like magic. I could have the computer sum up millions
of numbers, saving me tons of time and effort. But I quickly discovered that only
works when the data is clean. If there was even one accidental letter in
a column that should only have numbers, the computer wouldn’t know what to do. So, it would throw an error, and
suddenly, I’m stuck. And there’s no way I could add up
millions of numbers by myself. So, I have to clean out
that data to make it work. The good news is that there’s
plenty of effective processes and tools to help you do that. Coming up, you’ll gain the skills and knowledge you need to make sure
the data you work with is always clean. Along the way, we’ll dig deeper into the
difference between clean and dirty data, and why clean data is so important. We’ll also talk about different
ways to clean your data and common problems to look for
during the process. Ready to start? Let’s do it.

Video: Why data cleaning is important

Importance of Clean Data:

  • Dirty data (incomplete, incorrect, irrelevant) leads to unreliable analysis and poor decisions.
  • Clean data (complete, correct, relevant) enables accurate analysis, informs effective decisions, and prevents problems.

Data Cleaning Analogy:

  • Brushing teeth: Regular cleaning prevents issues like cavities and gum disease. Similarly, regular data cleaning avoids costly errors and improves analysis.

Data Cleaning Process:

  • Examine data before analysis.
  • Identify and address common errors like:
    • Duplicates: Eliminate redundant entries.
    • Nulls: Decide how to handle missing values (filtering, imputation).
    • Inconsistency: Standardize formats and units.

Data Sources:

  • Internal data (maintained by data engineers and warehousing specialists) is generally cleaner.
  • External data (surveys, web scraping) often requires more cleaning due to nulls, inconsistencies, and potential biases.

Next Steps:

  • Learn about common data errors and cleaning techniques.
  • Gain the skills to effectively clean both internal and external data.

Overall Message:

Data cleaning is a crucial skill for data analysts. Regular cleaning ensures accurate analysis, prevents problems, and ultimately promotes data-driven decision making.

Introduction:

Imagine trying to solve a puzzle with missing pieces or pieces that don’t fit together. That’s what data analysis feels like when you’re working with dirty data. Data cleaning, while often overlooked, is the fundamental step that sets the stage for reliable and meaningful analysis. In this tutorial, we’ll delve into the importance of data cleaning and equip you with the knowledge to approach your analysis with confidence.

The Cost of Dirty Data:

“Garbage in, garbage out” perfectly summarizes the impact of dirty data. Inaccurate or incomplete data leads to misleading conclusions, flawed decisions, and ultimately, wasted resources. Studies estimate the annual cost of poor-quality data in the US alone to be a staggering $3.1 trillion.

How Does Data Get Dirty?

Several factors contribute to dirty data:

  • Human Error: Typos, misinterpretations, and inconsistencies in data entry are common culprits.
  • Technical Glitches: Software bugs or hardware failures can corrupt data during collection or storage.
  • Data Integration: Combining data from multiple sources often leads to inconsistencies in formats, units, and definitions.
  • External Data: Publicly available data or information scraped from websites can be unreliable and require significant cleaning.

Consequences of Dirty Data:

The effects of dirty data can be far-reaching:

  • Misleading Analysis: Biased or inaccurate results lead to incorrect conclusions and ineffective decisions.
  • Financial Losses: Making decisions based on bad data can result in lost revenue, wasted resources, and even legal repercussions.
  • Reputational Damage: Publishing flawed analysis based on dirty data can erode trust and damage an organization’s reputation.

Benefits of Clean Data:

Investing in data cleaning provides numerous benefits:

  • Improved Accuracy: Clean data leads to more reliable and trustworthy analysis, resulting in sounder decisions.
  • Enhanced Efficiency: By eliminating irrelevant or redundant data, you streamline your analysis process and save valuable time.
  • Boosted Productivity: Clean data makes it easier to identify patterns, trends, and relationships, leading to deeper insights and improved problem-solving.
  • Increased Confidence: Knowing your data is clean allows you to confidently share your findings and make informed recommendations.

Data Cleaning Techniques:

Various techniques are employed to clean data, depending on the specific issues encountered:

  • Identifying and correcting errors: This involves finding and fixing typos, inconsistencies, and missing values.
  • Standardizing formats: Ensure data formats are consistent across columns and rows, including units, dates, and currencies.
  • Handling duplicates: Identify and remove redundant entries to avoid overcounting or skewed results.
  • Dealing with missing values: Decide how to handle missing data, such as filtering, imputation, or flagging.
  • Validating data: Check for outliers, inconsistencies, and potential biases to ensure data quality.

Conclusion:

Data cleaning is an essential, yet often underestimated, step in the data analysis process. By investing time and effort into cleaning your data, you lay the foundation for successful analysis, informed decision-making, and ultimately, organizational success. Remember, clean data is the fuel that drives your analysis engine, and with it, you can explore, discover, and unlock the true potential of your data.

Next Steps:

  • Explore specific data cleaning techniques for common data errors.
  • Practice data cleaning using different tools and software.
  • Share your experiences and challenges with other data enthusiasts.

By embracing data cleaning as a crucial part of your analytical workflow, you can ensure that your results are accurate, insightful, and impactful.

Clean data is incredibly important for effective analysis. If a piece of data
is entered into a spreadsheet or
database incorrectly, or if it’s repeated, or if a field is left blank, or if data formats
are inconsistent, the result is dirty data. Small mistakes can lead to big consequences in the long run. I’ll be completely
honest with you, data cleaning is like
brushing your teeth. It’s something you
should do and do properly because otherwise it
can cause serious problems. For teeth, that might be
cavities or gum disease. For data, that might be costing your company money,
or an angry boss. But here’s the good news. If you keep brushing twice a day, every day, it becomes a habit. Soon, you don’t even
have to think about it. It’s the same with data. Trust me, it will
make you look great when you take the time to
clean up that dirty data. As a quick refresher, dirty data is incomplete, incorrect, or irrelevant to the problem you’re
trying to solve. It can’t be used in a
meaningful way, which makes analysis very
difficult, if not impossible. On the other hand,
clean data is complete, correct, and relevant to the problem you’re
trying to solve. This allows you to
understand and analyze information and identify
important patterns, connect related information,
and draw useful conclusions. Then you can apply what you learn to make effective decisions. In some cases, you won’t have to do a lot of work
to clean data. For example, when you use internal data that’s
been verified and cared for by your company’s data engineers
and data warehouse team, it’s more likely to be clean. Let’s talk about
some people you’ll work with as a data analyst. Data engineers
transform data into a useful format for analysis and give it a
reliable infrastructure. This means they
develop, maintain, and test databases, data
processors and related systems. Data warehousing specialists
develop processes and procedures to effectively
store and organize data. They make sure that
data is available, secure, and backed
up to prevent loss. When you become a data analyst, you can learn a lot by working with the
person who maintains your databases to learn
about their systems. If data passes
through the hands of a data engineer or a data
warehousing specialist first, you know you’re off to a
good start on your project. There’s a lot of great
career opportunities as a data engineer or a data
warehousing specialist. If this kind of work
sounds interesting to you, maybe your career
path will involve helping organizations
save lots of time, effort, and money by making sure their data is
sparkling clean. But even if you go in a different direction with
your data analytics career and have the advantage
of working with data engineers and
warehousing specialists, you’re still likely to have
to clean your own data. It’s important to remember: no dataset is perfect. It’s always a good
idea to examine and clean data before
beginning analysis. Here’s an example. Let’s say you’re working on a
project where you need to figure out
how many people use your company’s
software program. You have a spreadsheet that
was created internally and verified by a data engineer and a data warehousing
specialist. Check out the column
labeled “Username.” It might seem logical that
you can just scroll down and count the rows to figure out
how many users you have. But that won’t work
because one person sometimes has more
than one username. Maybe they registered from
different email addresses, or maybe they have a work
and personal account. In situations like this, you would need to
clean the data by eliminating any rows
that are duplicates. Once you’ve done
that, there won’t be any more duplicate entries. Then your spreadsheet is
ready to be put to work. So far we’ve discussed
working with internal data. But data cleaning becomes even more important when working with external data, especially if it comes from multiple sources. Let’s say the software company
from our example surveyed its customers to learn how satisfied they are with
its software product. But when you review
the survey data, you find that you
have several nulls. A null is an indication that a value does not
exist in a data set. Note that it’s not
the same as a zero. In the case of a survey, a null would mean the customers
skipped that question. A zero would mean they provided
zero as their response. To do your analysis, you would first need
to clean this data. Step one would be to decide
what to do with those nulls. You could either
filter them out and communicate that you now
have a smaller sample size, or you can keep them in
and learn from the fact that the customers did
not provide responses. There’s lots of reasons why
this could have happened. Maybe your survey questions weren’t written as
well as they could be. Maybe they were
confusing or biased, something we learned
about earlier. We’ve touched on the basics of cleaning internal
and external data, but there’s lots more to come. Soon, we’ll learn about
the common errors to be aware of to ensure
your data is complete, correct, and relevant.
See you soon!

Video: Angie: Why I love cleaning data

  • Angie loves data cleaning because it reveals the true nature of the data, its quirks, flaws, and mysteries.
  • She shares an example of a data mystery where a million dollars worth of chicken sandwiches seemed to be purchased.
  • Cleaning revealed a technical error where transactions were squared and multiplied.
  • The process took three days and felt like solving a detective case.
  • Angie emphasizes the joy of cleaning data and gaining a deeper understanding of its secrets.

Key Message:

Data cleaning is not just a technical task but also a fascinating journey of discovery and insight, turning Angie into a data detective who unlocks the mysteries within.

I am Angie. I’m a program manager of engineering at Google. I truly believe
that cleaning Data is the heart and soul of data. It’s how you get
to know your data: its quirks, its flaws,
it’s mysteries. I love a good mystery. I remember one time I found
somebody had purchased, I think it was one
million dollars worth of chicken sandwiches
in one transaction. This mystery drove me nuts. I had all these questions. Could this have really happened? Or maybe it was a really
big birthday party? How did they make
a million dollars worth of chicken sandwiches? I was cleaning my data and trying to figure out
where did it go wrong. We ended up finding out
that we’d been squaring and multiplying all of
our transactions for a very specific case. It took us about three
days to figure this out. I will never forget the
moment when it was like, aha! We got to the bottom of it. The result is our data
was cleaned, and we had this great dataset that
we could use for analysis. But what I loved was
just the mystery of it and getting to know all these weird intricacies
about my dataset. It felt like a superpower almost, like I was a detective, and I had gone in
there and I had really solved something. I
love cleaning data!

Reading: What is dirty data?

Video: Recognize and remedy dirty data

This video focuses on identifying and understanding common types of dirty data encountered in data analysis, using a law office spreadsheet as an example. Here’s a summary of the key points:

Types of Dirty Data:

  • Typos and inconsistencies: Misspellings, punctuation errors, mixed-up letters, and inconsistent formats (e.g., confusing currency and percentages).
  • Mismatched units: Data entered in different units (e.g., US dollars vs. euros).
  • Inconsistent formatting: Data formats not aligning with intended representation (e.g., currency shown as percentage).
  • Nulls: Empty fields missing information requiring further research.
  • Duplicates: Repeated entries of the same data (e.g., duplicate appointments).
  • Incorrect labeling: Mislabeled data causing misinterpretation (e.g., wrong animal label in image recognition).
  • Uneven field lengths: Inconsistent character allowance in fields leading to potential errors (e.g., variable year lengths).

Emphasis on Data Cleaning:

  • The video underscores the importance of data cleaning in ensuring accurate analysis.
  • Different techniques for cleaning each type of dirty data will be covered later.
  • Proper data validation practices can help prevent dirty data entry.

Overall, this video sets the stage for understanding and addressing various issues associated with dirty data, paving the way for further exploration of data cleaning techniques.

Dirty data is the bane of any data analyst’s existence. It lurks in spreadsheets, databases, and APIs, hindering our efforts to extract meaningful insights. But fear not! This tutorial equips you with the tools to recognize and remedy these pesky inaccuracies.

Part 1: Identifying Dirty Data

  1. Typos and Inconsistencies: Keep an eye out for misspellings, punctuation errors, and mixed-up letters. Inconsistent capitalization, formatting, and units (e.g., USD vs. EUR) are also red flags.
  2. Missing Values: Nulls or empty fields indicate missing information. Investigate further to understand why data is missing and how to fill it in responsibly.
  3. Duplicates: Identify and eliminate redundant entries. Duplicates can skew your analysis and lead to inaccurate conclusions.
  4. Outliers: Data points that significantly deviate from the norm might be errors or represent genuine outliers. Analyze them cautiously and decide whether to include them in your analysis.
  5. Inconsistent Labeling: Mislabeled data can throw off your entire analysis. Verify labels are accurate and consistent across your dataset.
  6. Formatting Inconsistency: Data types (text, numbers, dates) should be consistent within fields and columns. Inconsistencies can lead to inaccurate calculations and interpretations.

Part 2: Remediating Dirty Data

  1. Standardization: Standardize capitalization, punctuation, and formatting for consistency. Apply data type validations to prevent errors during entry.
  2. Imputation: For missing values, consider imputation techniques like mean, median, or mode filling. Always document the methods used for transparency.
  3. Deductive Reasoning: In some cases, missing values can be deduced from context clues or available information. Use logic and knowledge of the data domain to fill in the gaps responsibly.
  4. Data Profiling: Tools like data profiling analyze statistical distributions, identify outliers, and highlight inconsistencies. Use these insights to clean and refine your data.
  5. Normalization: Normalization ensures data conforms to a standard format, facilitating comparisons and analysis across different sources.
  6. Validation and Verification: Regularly validate and verify your data cleaning efforts. Run test queries and analyses to ensure accuracy and consistency.

Remember:

  • No single solution fits all. Choose the appropriate remediation technique based on the type and severity of the dirty data.
  • Document your cleaning process carefully. Transparency is key in understanding how your data was transformed and ensuring its reliability.
  • Data cleaning is an ongoing process. As new data comes in, be vigilant in identifying and addressing any emerging issues.

Bonus Tools:

  • Utilize specialized data cleaning tools and libraries like Pandas (Python) or OpenRefine to automate tasks and streamline the process.
  • Seek guidance from experienced data analysts and learn from their techniques.

By mastering these skills, you’ll be well on your way to conquering dirty data and unlocking the true potential of your analysis. Keep practicing, stay curious, and remember: clean data is happy data!

Fill in the blank: Data _____ is a cleaning feature to check the accuracy and quality of data before adding or importing it.

validation

Data validation is a cleaning feature to check the accuracy and quality of data before adding or importing it.

Hey, there. In this video, we’ll focus on common issues
associated with dirty data. These includes spelling
and other texts errors, inconsistent labels,
formats and field lane, missing data and duplicates. This will help you
recognize problems quicker and give you
the information you need to fix them when you encounter something similar
during your own analysis. This is incredibly important
in data analytics. Let’s go back to our
law office spreadsheet. As a quick refresher, we’ll start by checking out the different types of
dirty data it shows. Sometimes, someone might key in a piece of
data incorrectly. Other times, they might not keep data formats consistent. It’s also common to
leave a field blank. That’s also called a null, which we learned about earlier. If someone adds the same
piece of data more than once, that creates a duplicate. Let’s break that down. Then we’ll learn about
a few other types of dirty data and strategies
for cleaning it. Misspellings, spelling
variations, mixed up letters, inconsistent punctuation,
and typos in general, happen when someone types in
a piece of data incorrectly. As a data analyst, you’ll also deal with
different currencies. For example, one
dataset could be in US dollars and another in euros, and you don’t want to
get them mixed up. We want to find these types of errors and fix them like this. You’ll learn more
about this soon. Clean data depends largely on the data integrity rules that
an organization follows, such as spelling and
punctuation guidelines. For example, a beverage company might ask everyone working in its database to enter data about volume in fluid ounces
instead of cups. It’s great when an organization has rules like this in place. It really helps minimize the amount of data
cleaning required, but it can’t eliminate
it completely. Like we discussed earlier, there’s always the
possibility of human error. The next type of dirty data our spreadsheet shows is
inconsistent formatting. In this example, something that should
be formatted as currency is shown
as a percentage. Until this error is fixed, like this, the law office will have no idea how much money this customer paid
for its services. We’ll learn about
different ways to solve this and many other
problems soon. We discussed nulls previously, but as a reminder, nulls are empty fields. This kind of dirty data requires a little more work
than just fixing a spelling error or
changing a format. In this example, the data analysts would need to research which customer had a consultation on
July 4th, 2020. Then when they find the
correct information, they’d have to add it
to the spreadsheet. Another common type of
dirty data is duplicated. Maybe two different people added this appointment on August 13th, not realizing that someone
else had already done it or maybe the person
entering the data hit copy and paste by accident. Whatever the reason, it’s
the data analyst job to identify this error and correct it by deleting one
of the duplicates. Now, let’s continue on to some other types of dirty data. The first has to
do with labeling. To understand labeling, imagine trying to get a
computer to correctly identify panda bears among images of all different
kinds of animals. You need to show the
computer thousands of images of panda bears. They’re all labeled
as panda bears. Any incorrectly labeled picture, like the one here
that’s just bear, will cause a problem. The next type of
dirty data is having an inconsistent field length. You learned earlier that a
field is a single piece of information from a row or
column of a spreadsheet. Field length is a
tool for determining how many characters can
be keyed into a field. Assigning a certain
length to the fields in your spreadsheet is a
great way to avoid errors. For instance, if you have a column for
someone’s birth year, you know the field
length is four because all years are
four digits long. Some spreadsheet
applications have a simple way to
specify field lengths and make sure users
can only enter a certain number of
characters into a field. This is part of data validation. Data validation is
a tool for checking the accuracy and quality of data before adding
or importing it. Data validation is a
form of data cleansing, which you’ll learn
more about soon. But first, you’ll get familiar with more techniques
for cleaning data. This is a very important part
of the data analyst job. I look forward to sharing these data cleaning
strategies with you.

Practice Quiz: Test your knowledge on clean versus dirty data

Describe the difference between a null and a zero in a dataset.

What are the most common processes and procedures handled by data engineers? Select all that apply.

What are the most common processes and procedures handled by data warehousing specialists? Select all that apply.

A data analyst is cleaning a dataset. They want to confirm that users entered five-digit zip codes correctly by checking the data in a certain spreadsheet column. What would be most helpful as the next step?

Being cleaning data


Video: Data-cleaning tools and techniques

Cleaning Dirty Data in Spreadsheets: Key Takeaways

Main Idea: This video focuses on essential techniques for cleaning messy data in spreadsheets to ensure data integrity and facilitate reliable analysis.

Cleaning Techniques:

  • Remove Unwanted Data:
    • Duplicates: Can skew results and require removal, either manually or through built-in tools.
    • Irrelevant Data: Information unrelated to the current analysis should be excluded.
  • Clean Text:
    • Extra Spaces & Blanks: Eliminate unnecessary spaces to avoid sorting/filtering issues.
    • Typos & Formatting: Fix misspellings, capitalization inconsistencies, and punctuation errors. Tools like spellcheck and autocorrect can help.
  • Formatting Consistency:
    • Apply a consistent format (e.g., font style, alignment) for easier data visualization and decision-making. Use “clear formats” tools for quick cleanup.

Benefits of Data Cleaning:

  • Improved data quality
  • More accurate analysis and decision-making
  • Increased efficiency and productivity

Next Steps: The next video will cover cleaning data from multiple sources.

Remember: Cleaning data is crucial for obtaining reliable insights and making informed decisions based on your data.

Data Cleaning Tools and Techniques: A Practical Guide

Introduction:

Dirty data – it lurks in every spreadsheet, database, and API. But fear not, data analysts! This tutorial equips you with the tools and techniques to transform messy data into sparkling insights.

Part 1: Identifying the Enemy:

  1. Missing Values: Are your data points riddled with gaps? Learn how to impute missing values, estimate them, or even gracefully handle their absence.
  2. Inconsistent Formatting: Dates in M/D/Y, D/M/Y, or Y-M-D? Numbers with commas or periods? Standardize your data formats for seamless analysis.
  3. Duplicates: Double the data, double the trouble! Use built-in functions or advanced algorithms to hunt down and eliminate duplicates.
  4. Outliers: Do a few data points skew your results? Understand outlier detection techniques and decide whether to keep, transform, or remove them.
  5. Textual Mess: Typos, inconsistent capitalization, and extra spaces can wreak havoc. Leverage spellcheck, regular expressions, and text cleaning functions to wrangle unruly text.

Part 2: Cleaning Up Your Act:

  1. Manual Cleanup: For small datasets, sometimes the best tool is your own eagle eye. Spot inconsistencies, fix typos, and apply formatting rules with a focused approach.
  2. Built-in Tools: Most spreadsheets and data analysis software offer cleaning functionalities like duplicate removal, missing value imputation, and conditional formatting. Explore these readily available tools for efficient tidying.
  3. Scripting Solutions: Advanced users can leverage Python libraries like pandas and NumPy to write custom scripts for complex data cleaning tasks. Automate repetitive processes and unlock the power of code.
  4. Machine Learning Magic: For the data science enthusiasts, machine learning algorithms can identify subtle patterns and anomalies in your data, suggesting intelligent cleaning strategies.

Part 3: Tools of the Trade:

  • Spreadsheets: Excel, Google Sheets, LibreOffice Calc – your friendly neighborhood data wranglers, equipped with basic but effective cleaning tools.
  • Data Analysis Software: Tools like Power BI, Tableau, and R provide powerful visual analysis tools alongside data cleaning functionalities.
  • Statistical Packages: SPSS and SAS offer advanced statistical analysis along with data cleaning capabilities for large datasets.
  • Open-Source Libraries: Python’s pandas and NumPy libraries, R’s dplyr and tidyverse packages, and open-source tools like OpenRefine are powerful data wrangling and cleaning champions.

Remember:

  • Data cleaning is an iterative process. Be prepared to revisit and refine your cleaning as you delve deeper into your data.
  • Document your cleaning steps! Transparency is key for reproducibility and sharing insights.
  • Embrace the power of automation. Scripts and machine learning can save you time and effort in the long run.

Conclusion:

With the right tools and techniques, dirty data becomes your playground for discovery. Go forth, conquer inconsistencies, and unleash the power of clean data!

Bonus:

  • Check out online courses and tutorials for specific data cleaning challenges.
  • Practice your skills on publicly available datasets.
  • Share your data cleaning tips and tricks with the data community!

I hope this tutorial serves as a springboard for your data cleaning journey. Now go forth and conquer messy data with confidence!

Hi. Now that you’re familiar with some
of the most common types of dirty data, it’s time to clean them up. As you’ve learned, clean data is
essential to data integrity and reliable solutions and decisions. The good news is that spreadsheets have
all kinds of tools you can use to get your data ready for analysis. The techniques for data cleaning will be
different depending on the specific data set you’re working with. So we won’t cover everything you might
run into, but this will give you a great starting point for fixing the types of
dirty data analysts find most often. Think of everything that’s coming up as
a teaser trailer of data cleaning tools. I’m going to give you a basic overview
of some common tools and techniques, and then we’ll practice them again later on. Here, we’ll discuss how to remove
unwanted data, clean up text to remove extra spaces and blanks, fix typos,
and make formatting consistent. However, before removing unwanted data, it’s always a good practice to
make a copy of the data set. That way, if you remove something that
you end up needing in the future, you can easily access it and
put it back in the data set. Once that’s done, then you can move on
to getting rid of the duplicates or data that isn’t relevant to
the problem you’re trying to solve. Typically, duplicates appear when you’re
combining data sets from more than one source or using data from multiple
departments within the same business. You’ve already learned
a bit about duplicates, but let’s practice removing them once
more now using this spreadsheet, which lists members of a professional
logistics association. Duplicates can be a big problem for
data analysts. So it’s really important
that you can find and remove them before any analysis starts. Here’s an example of
what I’m talking about. Let’s say this association has
duplicates of one person’s $500 membership in its database. When the data is summarized, the analyst would think there was
$1,000 being paid by this member and would make decisions
based on that incorrect data. But in reality,
this member only paid $500. These problems can be fixed manually, but
most spreadsheet applications also offer lots of tools to help you find and
remove duplicates. Now, irrelevant data, which is data that
doesn’t fit the specific problem that you’re trying to solve,
also needs to be removed. Going back to our association membership
list example, let’s say a data analyst was working on a project that
focused only on current members. They wouldn’t want to include information
on people who are no longer members, or who never joined in the first place. Removing irrelevant data takes a little
more time and effort because you have to figure out the difference between
the data you need and the data you don’t. But believe me, making those decisions will save
you a ton of effort down the road. The next step is removing extra
spaces and blanks. Extra spaces can cause unexpected
results when you sort, filter, or search through your data. And because these characters are easy to
miss, they can lead to unexpected and confusing results. For example, if there’s an extra space
and in a member ID number, when you sort the column from lowest to
highest, this row will be out of place. To remove these unwanted spaces or
blank cells, you can delete them yourself. Or again, you can rely
on your spreadsheets, which offer lots of great functions for
removing spaces or blanks automatically. The next data cleaning step
involves fixing misspellings, inconsistent capitalization,
incorrect punctuation, and other typos. These types of errors can
lead to some big problems. Let’s say you have a database of emails
that you use to keep in touch with your customers. If some emails have misspellings,
a period in the wrong place, or any other kind of typo, not only do you
run the risk of sending an email to the wrong people, you also run
the risk of spamming random people. Think about our association
membership example again. Misspelling might cause the data
analyst to miscount the number of professional members if they
sorted this membership type and then counted the number of rows. Like the other problems
you’ve come across, you can also fix these problems manually. Or you can use spreadsheet tools,
such as spellcheck, autocorrect, and conditional formatting
to make your life easier. There’s also easy ways to convert text
to lowercase, uppercase, or proper case, which is one of the things
we’ll check out again later. All right, we’re getting there. The next step is removing formatting. This is particularly important when you
get data from lots of different sources. Every database has its own formatting, which can cause the data
to seem inconsistent. Creating a clean and consistent visual
appearance for your spreadsheets will help make it a valuable tool for you and
your team when making key decisions. Most spreadsheet applications
also have a “clear formats” tool, which is a great time saver. Cleaning data is an essential step in
increasing the quality of your data. Now you know lots of
different ways to do that. In the next video, you’ll take
that knowledge even further and learn how to clean up data that’s
come from more than one source.

Video: Cleaning data from multiple sources

Merging Multiple Datasets: Challenges and Strategies

Main Idea:

This video focuses on the challenges and strategies of cleaning and merging data from multiple sources, commonly encountered by data analysts.

Merging Challenges:

  • Inconsistency: Different organizations use different formats, terminology, and structures, leading to misalignment and duplication in merged data.
  • Identifying Duplicates: Same people might be present in both datasets with different identifying numbers or formats (e.g., email vs. member ID).
  • Reconciling Differences: Varying member types and categories need careful comparison and standardization to avoid confusion.

Merging Strategies:

  • Compatibility Check: Ensure datasets share relevant data points and have similar formats before attempting to merge.
  • Data Cleaning: Cleanliness of individual datasets is crucial before merging to avoid propagating errors and contradictions.
  • Standardization: Establish consistent formats, terminology, and identifiers to streamline merging and analysis.
  • Tool Selection: Choose appropriate tools like spreadsheets, SQL queries, or programming languages (e.g., R) based on data complexity and desired cleaning/merging methods.

Benefits of Merging:

  • Gain comprehensive insights by combining related data from different sources.
  • Uncover broader patterns and trends not evident in individual datasets.
  • Improve decision-making based on richer and more diverse information.

Conclusion:

Merging data from multiple sources can be challenging due to inconsistencies, but careful planning, cleaning, and standardization techniques can pave the way for valuable insights and informed decision-making.

Remember:

  • Asking key questions about data availability, cleanliness, and compatibility is crucial before merging.
  • Choose the right tools and techniques based on the specific task and dataset complexity.
  • Practice merging on smaller datasets to build skills and confidence before tackling larger projects.

Tutorial: Cleaning Data from Multiple Sources in Data Analysis

Introduction

In this tutorial, we’ll explore the essential steps and techniques involved in cleaning and preparing data from multiple sources for effective analysis.

Key Challenges

  • Inconsistencies: Different formats, naming conventions, and data structures across datasets.
  • Duplicate Data: Redundant information due to overlapping sources.
  • Missing Values: Inconsistent handling of missing data in different datasets.
  • Data Quality Issues: Errors, typos, and inconsistencies within individual datasets.

Steps for Cleaning Data from Multiple Sources

  1. Assess Compatibility:
    • Identify common fields or variables that can be used for merging.
    • Check for compatibility in data types and formats.
    • Understand the structure and relationships within each dataset.
  2. Clean Individual Datasets:
    • Address missing values (imputation, deletion, marking).
    • Correct errors and inconsistencies (typos, formatting, outliers).
    • Standardize formats (date/time, capitalization, units of measurement).
  3. Merge Datasets:
    • Choose a merging method (inner, outer, left, right) based on analysis goals.
    • Use appropriate tools (spreadsheets, SQL, programming languages like Python or R).
    • Create a unified dataset with consistent structure and formatting.
  4. Identify and Handle Duplicates:
    • Use unique identifiers (if available) or combinations of variables to detect duplicates.
    • Decide whether to remove duplicates or retain them with additional information.
  5. Reconcile Differences:
    • Map different terminology or categories to a common standard.
    • Address inconsistencies in data values or coding schemes.
    • Validate merged data to ensure accuracy and coherence.
  6. Document the Cleaning Process:
    • Record cleaning steps, decisions, and transformations for reproducibility and transparency.

Tools for Data Cleaning

  • Spreadsheets: Excel, Google Sheets, LibreOffice Calc (for basic cleaning tasks).
  • Programming Languages: Python (pandas, NumPy), R (dplyr, tidyverse), SQL (for complex cleaning and merging).
  • Visualization Tools: Tableau, Power BI, R (ggplot2) (for visual inspection and cleaning).
  • Data Cleaning Software: OpenRefine (for specialized data cleaning tasks).

Best Practices

  • Start with clean individual datasets before merging.
  • Understand the relationships between datasets and merging goals.
  • Define a clear data cleaning plan and document the process.
  • Use appropriate tools based on dataset size and complexity.
  • Validate merged data to ensure accuracy and completeness.

Conclusion

Cleaning data from multiple sources is a crucial step in data analysis. By following these guidelines and choosing appropriate tools, you can create reliable and meaningful insights from diverse data sources.

Welcome back. So far
you’ve learned a lot about dirty data and how to clean up the most common
errors in a dataset. Now we’re going to take
that a step further and talk about cleaning
up multiple datasets. Cleaning data that comes from two or more sources is very
common for data analysts, but it does come with some
interesting challenges. A good example is a merger, which is an agreement that unites two organizations into
a single new one. In the logistics field, there’s been lots of
big changes recently, mostly because of
the e-commerce boom. With so many people
shopping online, it makes sense that the
companies responsible for delivering those
products to their homes are in the middle
of a big shake-up. When big things happen
in an industry, it’s common for two
organizations to team up and become
stronger through a merger. Let’s talk about how that will affect our logistics association. As a quick reminder, this spreadsheet lists
association member ID numbers, first and last names, addresses, how much each member
pays in dues, when the membership expires, and the membership types. Now, let’s think about
what would happen if the International Logistics
Association decided to get together with the Global
Logistics Association in order to help their members handle the incredible demands
of e-commerce. First, all the data from each organization
would need to be combined using data merging. Data merging is the
process of combining two or more datasets
into a single dataset. This presents a unique
challenge because when two totally different
datasets are combined, the information is almost guaranteed to be
inconsistent and misaligned. For example, the Global Logistics
Association’s spreadsheet has a separate column for
a person’s suite, apartment, or unit number, but the International
Logistics Association combines that information
with their street address. This needs to be
corrected to make the number of address
columns consistent. Next, check out how the Global Logistics Association uses people’s email
addresses as their member ID, while the International
Logistics Association uses numbers. This is a big problem because people in a
certain industry, such as logistics, typically join multiple professional
associations. There’s a very good chance
that these datasets include membership information on
the exact same person, just in different ways. It’s super important to
remove those duplicates. Also, the Global
Logistics Association has many more member types
than the other organization. On top of that, it uses a term, “Young Professional” instead
of “Student Associate.” But both describe members who are still in school or just
starting their careers. If you were merging
these two datasets, you’d need to work with your
team to fix the fact that the two associations describe memberships
very differently. Now you understand
why the merging of organizations also requires
the merging of data, and that can be tricky. But there’s lots of
other reasons why data analysts merge datasets. For example, in one
of my past jobs, I merged a lot of data
from multiple sources to get insights about our
customers’ purchases. The kinds of insights
I gained helped me identify customer
buying patterns. When merging datasets, I always begin by asking myself
some key questions to help me avoid redundancy and to confirm that the
datasets are compatible. In data analytics,
compatibility describes how well two or more datasets
are able to work together. The first question
I would ask is, do I have all the data I need? To gather customer
purchase insights, I wanted to make sure I
had data on customers, their purchases, and
where they shopped. Next I would ask, does the data I need exist
within these datasets? As you learned earlier
in this program, this involves considering the entire dataset analytically. Looking through the data
before I start using it lets me get a feel
for what it’s all about, what the schema looks like, if it’s relevant to my
customer purchase insights, and if it’s clean data. That brings me to
the next question. Do the datasets need to be cleaned, or are they
ready for me to use? Because I’m working with
more than one source, I will also ask myself, are the datasets cleaned
to the same standard? For example, what fields
are regularly repeated? How are missing values handled? How recently was the data updated? Finding the answers to these questions and understanding if I need to fix any problems at the start of a project is a very important step
in data merging. In both of the examples
we explored here, data analysts could use either the spreadsheet tools
or SQL queries to clean up, merge, and prepare the
datasets for analysis. Depending on the tool
you decide to use, the cleanup process can be
simple or very complex. Soon, you’ll learn how to make the best choice for
your situation. As a final note, programming
languages like R are also very useful
for cleaning data. You’ll learn more
about how to use R and other concepts
we covered soon.

Reading: Common data-cleaning pitfalls

Reading

Practice Quiz: Hands-On Activity: Cleaning data with spreadsheets

Practice Quiz: Test your knowledge on data-cleaning techniques

Fill in the blank: Every database has its own formatting, which can cause the data to seem inconsistent. Data analysts use the _____ tool to create a clean and consistent visual appearance for their spreadsheets.

What is the process of combining two or more datasets into a single dataset?

Fill in the blank: In data analytics, _____ describes how well two or more datasets are able to work together.

Cleaning data in spreadsheets


Video: Data-cleaning features in spreadsheets

Key Tools:

  • Conditional formatting: Highlights cells based on specific conditions, aiding in identifying missing information.
  • Remove duplicates: Eliminates duplicate entries from the spreadsheet.
  • Format dates: Ensures consistent date format for accurate analysis.
  • Split text to columns: Separates combined data within a cell into multiple columns (e.g., splitting names or addresses).
  • Concat (concatenate): Joins multiple text strings into a single string.

Benefits:

  • Saves time and effort for data analysts.
  • Helps eliminate errors and inconsistencies.
  • Facilitates better data analysis and understanding.

Takeaways:

  • Spreadsheets are valuable tools for data cleaning.
  • Understanding and using various spreadsheet tools enhances efficiency and accuracy.
  • More data cleaning techniques with spreadsheets are coming up!

Here’s a tutorial on data-cleaning features in spreadsheets, incorporating visual aids:

Introduction:

  • Explain the importance of data cleaning for accurate and reliable analysis.
  • Introduce spreadsheets as powerful tools for data cleaning.

Key Data-Cleaning Features:

  1. Conditional Formatting:
    • Highlight cells based on rules to identify:
      • Missing values (e.g., blank cells)
      • Inconsistencies (e.g., text in a numeric column)
      • Outliers (e.g., values outside a typical range)
    • Demonstrate how to apply conditional formatting using examples.
  2. Removing Duplicates:
    • Remove duplicate rows to ensure accurate counts and avoid redundant data.
    • Show the “Remove Duplicates” feature in action.
  3. Formatting Dates:
    • Ensure consistent date formats for accurate analysis and avoid misinterpretation.
    • Demonstrate how to format dates using the appropriate tools.
  4. Splitting Text to Columns:
    • Separate combined data within a cell into multiple columns:
      • Split names into first and last names
      • Divide addresses into city, state, and zip code
      • Extract individual certifications from a list
    • Demonstrate the “Split Text to Columns” feature, highlighting the use of delimiters.
  5. Concatenating Text:
    • Combine text from multiple cells into a single cell:
      • Create full names from first and last names
      • Combine addresses
      • Construct full sentences
    • Demonstrate the “CONCATENATE” function or its equivalent.

Additional Tips:

  • Text Functions: Use text functions (e.g., UPPER, LOWER, TRIM, PROPER) to clean text data:
    • Convert text to uppercase or lowercase
    • Remove extra spaces
    • Capitalize first letters
  • Find and Replace: Fix typos and inconsistencies using the “Find and Replace” tool.
  • Data Validation: Set rules for data entry to prevent errors and maintain data integrity.

Best Practices:

  • Always work on a copy of your original data before making changes.
  • Test your cleaning techniques on a small sample before applying them to the full dataset.
  • Document your cleaning steps for reproducibility and transparency.

Conclusion:

  • Emphasize the importance of data cleaning for effective analysis.
  • Encourage practice with these features to improve data quality and accuracy.
  • Highlight the availability of additional data-cleaning tools and techniques for more complex tasks.

Hi again. As you learned earlier, there’s a lot of different
ways to clean up data. I’ve shown you some examples of how you can clean data manually, such as searching for and fixing misspellings or removing
empty spaces and duplicates. We also learned that lots of spreadsheet applications have tools that help simplify and speed up the data
cleaning process. There’s a lot of great
efficiency tools that data analysts
use all the time, such as conditional formatting, removing duplicates,
formatting dates, fixing text strings
and substrings, and splitting text to columns. We’ll explore those
in more detail now. The first is something called
conditional formatting. Conditional formatting
is a spreadsheet tool that changes how cells appear when values meet
specific conditions. Likewise, it can let
you know when a cell does not meet the
conditions you’ve set. Visual cues like this are very
useful for data analysts, especially when we’re working in a large spreadsheet
with lots of data. Making certain data
points standout makes the information easier
to understand and analyze. For cleaning data, knowing when the data doesn’t follow the
condition is very helpful. Let’s return to the logistics
association spreadsheet to check out conditional
formatting in action. We’ll use conditional formatting to highlight blank cells. That way, we know where there’s missing information so we can
add it to the spreadsheet. To do this, we’ll start by selecting the range
we want to search. For this example we’re
not focused on address 3 and address 5. The fields will include all the columns in our
spreadsheets, except for F and H. Next, we’ll go to Format and choose
Conditional formatting. Great. Our range is automatically
indicated in the field. The format rule will be to format cells if
the cell is empty. Finally, we’ll choose
the formatting style. I’m going to pick a
shade of bright pink, so my blanks really stand out. Then click “Done,” and the blank cells are
instantly highlighted. The next spreadsheet
tool removes duplicates. As you’ve learned before, it’s always smart to make
a copy of the data set before removing
anything. Let’s do that now. Great, now we can continue. You might remember that
our example spreadsheet has one association
member listed twice. To fix that, go to Data and
select “Remove duplicates.” “Remove duplicates” is a tool
that automatically searches for and eliminates duplicate
entries from a spreadsheet. Choose “Data has header row”
because our spreadsheet has a row at the very top that describes the contents
of each column. Next, select “All” because we want to inspect our
entire spreadsheet. Finally, “Remove duplicates.” You’ll notice the
duplicate row was found and immediately removed. Another useful spreadsheet tool enables you to make
formats consistent. For example, some of the dates in this spreadsheet are in
a standard date format. This could be confusing
if you wanted to analyze when association
members joined, how often they renewed
their memberships, or how long they’ve been
with the association. To make all of our
dates consistent, first select column J, then go to “Format,” select “Number,” then “Date.” Now all of our dates have
a consistent format. Before we go over the next tool, I want to explain what
a text string is. In data analytics, a text string is a group of characters
within a cell, most often composed of letters. An important characteristic of a text string is its length, which is the number
of characters in it. You’ll learn more
about that soon. For now, it’s also
useful to know that a substring is a smaller
subset of a text string. Now let’s talk about Split. Split is a tool that divides
a text string around the specified character and puts each fragment into a
new and separate cell. Split is helpful when you
have more than one piece of data in a cell and you
want to separate them out. This might be a person’s
first and last name listed together, or it could be a cell that contains someone’s city, state, country, and zip code, but you actually want each
of those in its own column. Let’s say this association
wanted to analyze all of the different professional
certifications its members have earned. To do this, you want each certification separated
out into its own column. Right now, the certifications
are separated by a comma. That’s the specified text
separating each item, also called the delimiter. Let’s get them separated. Highlight the column,
then select “Data,” and “Split text to columns.” This spreadsheet
application automatically knew that the comma
was a delimiter and separated each certification. But sometimes you might need to specify what the
delimiter should be. You can do that here. Split text to columns
is also helpful for fixing instances of
numbers stored as text. Sometimes values in
your spreadsheet will seem like numbers, but they’re formatted as text. This can happen when
copying and pasting from one place to another or if
the formatting’s wrong. For this example, let’s check out our new spreadsheet
from a cosmetics maker. If a data analyst wanted
to determine total profits, they could add up
everything in column F. But there’s a problem; one of the cells has an error. If you check into it, you learn that the
“707” in this cell is text and can’t be
changed into a number. When the spreadsheet tries
to multiply the cost of the product by the
number of units sold, it’s unable to make
the calculation. But if we select the orders column and choose
“Split text to columns,” the error is resolved because now it can be
treated as a number. Coming up, you’ll learn about a tool that does
just the opposite. CONCATENATE is a
function that joins multiple text strings
into a single string. Spreadsheets are a very important
part of data analytics. They save data analysts
time and effort and help us eliminate
errors each and every day. Here, you’ve learned about some of the most common
tools that we use. But there’s a lot more to come. Next, we’ll learn even
more about data cleaning with spreadsheet
tools. Bye for now!

Video: Optimize the data-cleaning process

Key Functions:

  • COUNTIF: Counts cells matching a specific value (e.g., checking price validity).
  • LEN: Determines the length of a text string (e.g., verifying ID code length).
  • LEFT/RIGHT: Extracts a set number of characters from the left/right of a string (e.g., separating product code parts).
  • MID: Extracts a specific segment from the middle of a string (e.g., isolating state abbreviations).
  • CONCATENATE: Joins multiple text strings into a single one (e.g., recombining separated product codes).
  • TRIM: Removes leading, trailing, and repeated spaces from text (e.g., ensuring accurate search results).

Benefits:

  • Automate data cleaning tasks for efficient analysis.
  • Ensure data accuracy and consistency.
  • Extract and combine specific information from text data.

Takeaways:

  • Functions offer powerful tools for cleaning and manipulating spreadsheet data.
  • Understanding and applying these functions can significantly improve data quality.
  • More advanced data cleaning techniques and practice opportunities are coming up.

Optimize Your Data Cleaning Process: A Practical Guide

Data cleaning often feels like an endless chore, but taking the right steps can drastically improve your efficiency and effectiveness. This tutorial outlines valuable strategies to optimize your data-cleaning process and make it less of a grind and more of a rewarding quest for data quality.

1. Understand Your Data and Goals:

  • Know your data source: Identify the origin, format, and potential biases of your data.
  • Define your cleaning goals: Clearly outline what you want to achieve through cleaning (e.g., removing duplicates, formatting dates, etc.).
  • Prioritize tasks: Focus on tasks most critical to your analysis and address them first.

2. Automate & Leverage Tools:

  • Utilize scripting languages: Automate repetitive tasks like finding and replacing errors or formatting data sets.
  • Explore data cleaning libraries: Python libraries like Pandas and NumPy offer powerful tools for data cleansing and manipulation.
  • Use built-in spreadsheet functions: Leverage functions like COUNTIF, TRIM, and CONCATENATE in excel or similar software for specific cleaning tasks.

3. Validate and Test Your Cleaning:

  • Sample testing: Clean a small sample of data first to check if your methods are effective and don’t inadvertently introduce errors.
  • Cross-validation: Compare your cleaned data with another source (if available) to ensure consistency and accuracy.
  • Visualize the results: Use data visualization tools to assess the distribution and identify any remaining anomalies after cleaning.

4. Document and Improve:

  • Maintain a cleaning log: Document your applied methods, encountered issues, and solutions for future reference.
  • Share your insights: Collaborate with your team or community to share best practices and learn from each other’s experiences.
  • Continuously optimize: Regularly reflect on your cleaning process and identify areas for further automation or improvement.

Bonus Tips:

  • Start with a backup: Always work on a copy of your original data to avoid accidentally corrupting the source.
  • Break down large tasks: Divide complex cleaning tasks into smaller, manageable steps.
  • Take breaks and revisit: Stepping away can help you identify fresh perspectives and avoid overlooking hidden issues.

By implementing these strategies, you can transform data cleaning from a tedious chore into a streamlined and efficient process, allowing you to focus on the exciting task of analyzing and uncovering meaningful insights from your data. Remember, optimizing your data cleaning process is an ongoing journey, so keep learning, experimenting, and improving your skills to become a data cleaning master!

Welcome back. You’ve
learned about some very useful data-
cleaning tools that are built right into
spreadsheet applications. Now we’ll explore
how functions can optimize your efforts to
ensure data integrity. As a reminder, a function is
a set of instructions that performs a specific calculation using the data in a spreadsheet. The first function we’ll
discuss is called COUNTIF. COUNTIF is a function
that returns the number of cells that
match a specified value. Basically, it counts
the number of times a value appears
in a range of cells. Let’s go back to our professional
association spreadsheet. In this example, we want to make sure the association membership
prices are listed accurately. We’ll use COUNTIF to check
for some common problems, like negative numbers
or a value that’s much less or much
greater than expected. To start, let’s find the
least expensive membership: $100 for student associates. That’ll be the lowest number
that exists in this column. If any cell has a value that’s less than 100, COUNTIF
will alert us. We’ll add a few more rows at the bottom of our spreadsheet, then beneath column
H, type “member dueS less than $100.” Next, type the function in
the cell next to it. Every function has
a certain syntax that needs to be
followed for it to work. Syntax is a predetermined
structure that includes all required information and its proper placement. The syntax of a COUNTIF
function should be like this: Equals COUNTIF, open parenthesis, range, comma, the specified value in quotation marks and a
closed parenthesis. It will show up like this. Where I2 through
I72 is the range, and the value is less than 100. This tells the function
to go through column I, and return a count
of all cells that contain a number less than 100. Turns out there is one! Scrolling through our data, we find that one
piece of data was mistakenly keyed in
as a negative number. Let’s fix that now. Now we’ll use COUNTIF to search for any values that are more
than we would expect. The most expensive
membership type is $500 for corporate members. Type the function in the cell. This time it will
appear like this: I2 through I72 is
still the range, but the value is
greater than 500. There’s one here
too. Check it out. This entry has an extra zero. It should be $100. The next function we’ll
discuss is called LEN. LEN is a function that
tells you the length of the text string by counting the number of
characters it contains. This is useful when
cleaning data if you have a certain
piece of information in your spreadsheet that you know must contain
a certain length. For example, this association uses six-digit member
identification codes. If we’d just imported this data and wanted to be
sure our codes are all the correct number
of digits, we’d use LEN. The syntax of LEN is equals LEN, open parenthesis, the range, and the close parenthesis. We’ll insert a new
column after Member ID. Then type an equals sign and LEN. Add an open parenthesis. The range is the first
Member ID number in A2. Finish the function by
closing the parenthesis. It tells us that there are
six characters in cell A2. Let’s continue the
function through the entire column and find out if any
results are not six. But instead of manually going through our
spreadsheet to search for these instances, we’ll use
conditional formatting. We talked about conditional
formatting earlier. It’s a spreadsheet tool that changes how cells appear when values meet specific conditions.
Let’s practice that now. Select all of column B
except for the header. Then go to Format and choose
Conditional formatting. The format rule is
to format cells if not equal to six. Click “Done.” The cell with the seven inside
is highlighted. Now we’re going to talk
about LEFT and RIGHT. LEFT is a function that gives you a set number of characters from the left side of a text string. RIGHT is a function
that gives you a set number of characters from the right side of a text string. As a quick reminder,
a text string is a group of characters
within a cell, commonly composed of
letters, numbers, or both. To see these functions in action, let’s go back to the
spreadsheet from the cosmetics maker from earlier. This spreadsheet
contains product codes. Each has a five-digit
numeric code and then a four-character
text identifier. But let’s say we only
want to work with one side or the other. You can use LEFT or RIGHT to give you the specific set of characters or numbers you need. We’ll practice
cleaning up our data using the LEFT function first. The syntax of LEFT is equals
LEFT, open parenthesis, the range, a comma, and a number of characters from the left side of the
text string we want. Then, we finish it with
a closed parenthesis. Here, our project requires just the five-digit
numeric codes. In a separate column, type equals LEFT, open
parenthesis, then the range. Our range is A2. Then, add a comma, and then number 5 for our five-
digit product code. Finally, finish the function
with a closed parenthesis. Our function should
show up like this. Press “Enter.” And now, we have a substring, which is the number part
of the product code only. Click and drag this function
through the entire column to separate out the rest of the product codes by number only. Now, let’s say our
project only needs the four-character text identifier. For that, we’ll use
the RIGHT function, and the next column will
begin the function. The syntax is equals RIGHT, open parenthesis, the range, a comma and the number
of characters we want. Then, we finish with
a closed parenthesis. Let’s key that in now. Equals right, open parenthesis,
and the range is still A2. Add a comma. This time, we’ll
tell it that we want the first four characters
from the right. Close up the parenthesis
and press “Enter.” Then, drag the function
throughout the entire column. Now, we can analyze
the product in our spreadsheet based
on either substring. The five-digit numeric code or the four character
text identifier. Hopefully, that makes it clear
how you can use LEFT and RIGHT to extract substrings from the left and right
sides of a string. Now, let’s learn
how you can extract something in between. Here’s where we’ll use
something called MID. MID is a function that gives you a segment from the
middle of a text string. This cosmetics company lists all of its clients
using a client code. It’s composed of the
first three letters of the city where the
client is located, its state abbreviation, and then a three-
digit identifier. But let’s say a data
analyst needs to work with just the
states in the middle. The syntax for MID is equals MID, open parenthesis, the
range, then a comma. When using MID, you always need to supply
a reference point. In other words, you need to set where the function should start. After that, place another comma, and how many middle
characters you want. In this case, our range is D2. Let’s start the function
in a new column. Type equals MID, open
parenthesis, D2. Then the first three characters
represent a city name, so that means the starting
point is the fourth. Add a comma and four. We also need to tell the function how many middle
characters we want. Add one more comma, and two, because the
state abbreviations are two characters long. Press “Enter” and bam, we just get the
state abbreviation. Continue the MID function
through the rest of the column. We’ve learned about a
few functions that help separate out specific
text strings. But what if we want to
combine them instead? For that, we’ll use CONCATENATE, which is a function that joins together two or
more text strings. The syntax is equals CONCATENATE, then an
open parenthesis inside indicates each text string you want to join,
separated by commas. Then finish the function
with a closed parenthesis. Just for practice, let’s
say we needed to rejoin the left and right
text strings back into complete product codes. In a new column, let’s
begin our function. Type equals CONCATENATE,
then an open parenthesis. The first text string
we want to join is in H2. Then add a comma. The second part is in I2. Add a closed parenthesis
and press “Enter”. Drag it down through
the entire column, and just like that, all of our product codes
are back together. The last function we’ll
learn about here is TRIM. TRIM is a function
that removes leading, trailing, and repeated
spaces in data. Sometimes when you import data, your cells have extra spaces, which can get in the
way of your analysis. For example, if this
cosmetics maker wanted to look up a
specific client name, it won’t show up in the search
if it has extra spaces. You can use TRIM to
fix that problem. The syntax for TRIM
is equals TRIM, open parenthesis, your range,
and closed parenthesis. In a separate column, type equals TRIM and
an open parenthesis. The range is C2, as you want to check
out the client names. Close the parenthesis
and press “Enter”. Finally, continue the
function down the column. TRIM fixed the extra spaces. Now we know some very
useful functions that can make your data cleaning
even more successful. This was a lot of information. As always, feel
free to go back and review the video and then
practice on your own. We’ll continue building
on these tools soon, and you’ll also have
a chance to practice. Pretty soon, these data
cleaning steps will become second nature,
like brushing your teeth.

Reading: Workflow automation

Reading

Video: Different data perspectives

Key Techniques:

  • Sorting and filtering: Organizes data for easier analysis, finds similar entries, and identifies duplicates.
  • Pivot tables: Provides a summarized view of specific data subsets, focusing on relevant information.
  • VLOOKUP: Searches for specified values in another data set and retrieves corresponding information.
  • Plotting: Visualizes data distribution, helps identify outliers and potential errors.

Benefits:

  • Improves data cleaning efficiency by focusing on specific needs.
  • Uncovers hidden insights and anomalies through different data perspectives.
  • Ensures data accuracy and completeness for reliable analysis.

Example:

  • Using pivot tables in a cosmetics data set to identify most profitable products (> $10,000 orders).
  • Employing VLOOKUP to match product codes with actual product names for further analysis.
  • Plotting product prices to detect potential errors like a misplaced decimal point.

Takeaways:

  • Adapting your data view enhances cleaning effectiveness and reveals valuable insights.
  • Combining these techniques with practice leads to improved data quality and analysis.
  • More advanced data cleaning strategies and practice opportunities are coming up.

Seeing Data Differently: Exploring Data Perspectives for Deeper Insights

Data often tells a different story depending on how you look at it. Just like viewing a landscape from various angles, changing your data perspective can reveal hidden patterns, trends, and insights that might be invisible otherwise. This tutorial dives into how different perspectives can transform your data analysis and unlock its full potential.

1. Zooming In and Out:

  • Granularity matters: Analyze data at multiple levels, from individual records to aggregated summaries. Imagine zooming in on a map: you can see specific streets and houses, but lose the bigger picture. Conversely, zooming out reveals the city layout and connections between neighborhoods. Similarly, analyzing individual customer purchases offers detailed insights, while grouping by product categories or regions provides broader trends.

2. Slicing and Dicing:

  • Segmenting data: Divide your data into smaller subsets based on specific criteria like demographics, product categories, or time periods. This allows you to compare and contrast different groups, revealing unique characteristics and behaviors within each segment. For instance, segmenting online shoppers by age groups might show distinct purchasing patterns for each generation.

3. Shifting the Focus:

  • Changing the dependent variable: Analyze the same data set with different variables as the focus. This highlights different relationships and dependencies within the data. Consider analyzing website traffic: you could track visits per page, conversions per campaign, or bounce rates per time of day. Each perspective reveals different aspects of user behavior.

4. Exploring Relationships:

  • Correlation and causation: Identify relationships between variables and assess their potential causal links. Visualizing data through scatter plots or correlation matrices can unveil hidden connections and potential areas for further investigation. Remember, correlation doesn’t always imply causation, so deeper analysis is crucial.

5. Time Travel:

  • Temporal analysis: Analyze data over time to understand trends, seasonality, andévolution. Techniques like time series analysis and trend forecasting help predict future patterns and make informed decisions. Tracking website traffic over months might reveal seasonal peaks or the impact of marketing campaigns.

Benefits of Diverse Data Perspectives:

  • Uncover hidden insights: Different perspectives reveal patterns and trends that might be invisible from a single viewpoint.
  • Deeper understanding: Gain a more comprehensive picture of your data and the relationships within it.
  • Informed decision-making: Base your decisions on a richer understanding of the data and its nuances.
  • Improved problem-solving: Identify root causes of issues and develop targeted solutions by viewing data from different angles.

Remember:

  • Experimentation is key: Try different data perspectives and visualizations to see what resonates and unlocks the most valuable insights.
  • Context matters: Always consider the context and limitations of your data when interpreting insights from different perspectives.
  • Combine perspectives: Don’t rely on a single viewpoint; triangulate your findings by combining insights from various perspectives for a more robust understanding.

By embracing different data perspectives, you can transform your data from a flat collection of numbers into a dynamic story waiting to be told. Remember, it’s not just about the data itself, but how you choose to look at it that unlocks its true potential and empowers you to make informed decisions and extract valuable knowledge.

I hope this tutorial helps you explore the exciting world of data perspectives and encourages you to see your data in a whole new light!

Hi, let’s get into it. Motivational speaker
Wayne Dyer once said, “If you change the way you look at things,
the things you look at change.” This is so true in data analytics. No two analytics projects
are ever exactly the same. So it only makes sense that
different projects require us to focus on different
information differently. In this video, we’ll explore different methods that data
analysts use to look at data differently and how that leads to more efficient and
effective data cleaning. Some of these methods include sorting and
filtering, pivot tables, a function called VLOOKUP, and
plotting to find outliers. Let’s start with sorting and filtering. As you learned earlier, sorting and filtering data helps data
analysts customize and organize the information the way
they need for a particular project. But these tools are also very useful for
data cleaning. You might remember that sorting involves
arranging data into a meaningful order to make it easier to understand,
analyze, and visualize. For data cleaning, you can use sorting to put things
in alphabetical or numerical order, so you can easily find a piece of data. Sorting can also bring duplicate entries
closer together for faster identification. Filters, on the other hand, are very
useful in data cleaning when you want to find a particular piece of information. You learned earlier that filtering
means showing only the data that meets a specific criteria while hiding the rest. This lets you view only
the information you need. When cleaning data,
you might use a filter to only find values above a certain number, or
just even or odd values. Again, this helps you find
what you need quickly and separates out the information
you want from the rest. That way you can be more
efficient when cleaning your data. Another way to change the way you
view data is by using pivot tables. You’ve learned that a pivot table is
a data summarization tool that is used in data processing. Pivot tables sort, reorganize, group, count,
total or average data stored in the database. In data cleaning, pivot tables
are used to give you a quick, clutter- free view of your data. You can
choose to look at the specific parts of the data set that you need to
get a visual in the form of a pivot table. Let’s create one now using our
cosmetic makers spreadsheet again. To start, select the data we want to use. Here, we’ll choose the entire spreadsheet.
Select “Data” and then “Pivot table.” Choose “New sheet” and “Create.” Let’s say we’re working on
a project that requires us to look at only the most
profitable products. Items that earn the cosmetics maker
at least $10,000 in orders. So the row we’ll include is “Total” for
total profits. We’ll sort in descending order to put
the most profitable items at the top. And we’ll show totals. Next, we’ll add another row for products so that we know what
those numbers are about. We can clearly determine tha
the most profitable products have the product codes
15143 E-X-F-O and 32729 M-A-S-C. We can ignore the rest for this particular project because
they fall below $10,000 in orders. Now, we might be able to use context
clues to assume we’re talking about exfoliants and mascaras. But we don’t know which ones, or
if that assumption is even correct. So we need to confirm what
the product codes correspond to. And this brings us to the next tool. It’s called VLOOKUP. VLOOKUP stands for vertical lookup. It’s a function that searches for a certain value in a column to return
a corresponding piece of information. When data analysts look up information for
a project, it’s rare for all of the data they need
to be in the same place. Usually, you’ll have to search
across multiple sheets or even different databases. The syntax of the VLOOKUP is
equals VLOOKUP, open parenthesis, then the data you want to look up. Next is a comma and
where you want to look for that data. In our example, this will be the name of a
spreadsheet followed by an exclamation point. The exclamation point indicates that
we’re referencing a cell in a different sheet from the one we’re
currently working in. Again, that’s very common
in data analytics. Okay, next is the range in
the place where you’re looking for data, indicated using the first and
last cell separated by a colon. After one more comma is the column in
the range containing the value to return. Next, another comma and the word “false,” which means that an exact match
is what we’re looking for. Finally, complete your function
by closing the parentheses. To put it simply, VLOOKUP searches for
the value in the first argument in the leftmost column of
the specified location. Then the value of the third
argument tells VLOOKUP to return the value in the same
row from the specified column. The “false” tells VLOOKUP
that we want an exact match. Soon you’ll learn the difference
between exact and approximate matches. But for now, just know that V lookup
takes the value in one cell and searches for a match in another place. Let’s begin. We’ll type equals VLOOKUP. Then add the data we are looking for,
which is the product data. The dollar sign makes sure that
the corresponding part of the reference remains unchanged. You can lock just the column, just the row, or both at the same time. Next, we’ll tell it to look
at Sheet 2, in both columns We added 2 to represent
the second column. The last term, “false,” says
we wanted an exact match. With this information,
we can now analyze the data for only the most profitable products. Going back to the two most profitable
products, we can search for 15143 E-X-F-O And
32729 M-A-S-C. Go to Edit and then Find. Type in the product codes and search for them. Now we can learn which products we’ll
be using for this particular project. The final tool we’ll talk about
is something called plotting. When you plot data,
you put it in a graph chart, table, or other visual to help you quickly
find what it looks like. Plotting is very useful when trying to
identify any skewed data or outliers. For example, if we want to make sure
the price of each product is correct, we could create a chart. This would
give us a visual aid that helps us quickly figure out if
anything looks like an error. So let’s select the column
with our prices. Then we’ll go to Insert and choose Chart. Pick a column chart as the type. One
of these prices looks extremely low. If we look into it, we discover that this item has
a decimal point in the wrong place. It should be $7.30, not 73 cents. That would have a big impact
on our total profits. So it’s a good thing we caught
that during data cleaning. Looking at data in new and creative ways helps data analysts
identify all kinds of dirty data. Coming up, you’ll continue
practicing these new concepts so you can get more comfortable with them. You’ll also learn additional strategies
for ensuring your data is clean, and we’ll provide you
with effective insights. Great work so far.

Video: Even more data-cleaning techniques

Data Mapping: Mastering Clean Data Migration

Focus: Data mapping as a crucial step for clean and compatible data migration.

Key Points:

  • Matching fields across databases: Data mapping ensures compatible data transfer and integration when different systems store data differently (e.g., Maryland vs. MD for state).
  • Importance of data mapping: Prevents errors and inconsistencies in merged data, leading to reliable analysis and avoiding ripple effects of mistakes.
  • Steps of data mapping:
    • Identify data to move and define desired format.
    • Map data field by field (simple or complex depending on schema and keys).
    • Use data mapping software for automation and validation.
    • Manually map data, ensuring content goes to the right place.
    • Transform data into a consistent format (e.g., concatenating address parts).
    • Transfer data to the destination (query, import wizards, drag & drop).
  • Testing and validation: Inspect sample data, check for nulls, use data cleaning tools (validation, formatting, COUNTIF, sorting, filtering).
  • Benefits of data mapping: Clear roadmap for safe data migration, avoids errors, enables reliable analysis.

Takeaway: Data mapping is an essential skill for data analysts to ensure clean and compatible data transfer, preventing errors and enabling accurate analysis.

Remember: This summary provides a concise overview of data mapping. Practice and experience are crucial to mastering this skill and ensuring successful data migration projects.

Level Up Your Data Cleaning Game: Advanced Techniques for Sparkling Data

Data cleaning often feels like tackling a mountain of laundry, but mastering advanced techniques can transform it into a satisfying chore, revealing pristine data ready for meaningful analysis. This tutorial explores some powerful tools to take your data cleaning to the next level:

1. Fuzzy Matching:

  • Embrace the inevitable typos: No dataset is immune to typos and misspellings. Fuzzy matching algorithms identify and match similar strings, correcting minor errors like “customer” vs. “custormer.”
  • Applications: Standardize names, product titles, or addresses, minimizing data inconsistencies and improving analysis accuracy.

2. Anomaly Detection:

  • Unmask the outliers: Outliers are data points that significantly deviate from the norm. Algorithms like Z-scores and boxplots detect these anomalies, prompting further investigation.
  • Benefits: Identify fraudulent transactions, malfunctioning sensors, or unexpected trends, leading to better data integrity and informed decisions.

3. Regular Expressions:

  • Tame the text beast: Regular expressions are powerful patterns used to search and manipulate text data. Extract specific information, like phone numbers or email addresses, from messy text fields.
  • Applications: Clean and standardize social media comments, scrape valuable insights from web pages, or automate repetitive text transformations.

4. Text Preprocessing:

  • Prepare text for analysis: Natural Language Processing (NLP) requires clean text data. Techniques like tokenization (splitting text into words) and stemming/lemmatization (reducing words to their root form) prepare text for efficient analysis.
  • Benefits: Improve the accuracy of sentiment analysis, topic modeling, and other NLP tasks by removing irrelevant information and ensuring consistent word representation.

5. Data Imputation:

  • Deal with missing values: Missing data can cripple analysis. Imputation techniques like mean or median filling, k-nearest neighbors, or machine learning models estimate missing values based on available data.
  • Caution: Choose imputation methods carefully, considering data distribution and potential biases introduced by estimated values.

Bonus Tip:

  • Automate wherever possible: Use scripting languages like Python or R to automate repetitive cleaning tasks, saving you time and ensuring consistency.

Remember:

  • Context matters: Choose the right technique based on your data type, analysis goals, and acceptable level of risk.
  • Validate and test: Always test and validate your cleaning workflow to ensure it improves, not degrades, data quality.
  • Continuous learning: Data cleaning is an evolving field. Stay updated on new techniques and best practices to maintain your data-cleaning mastery.

By incorporating these advanced techniques into your repertoire, you can conquer even the most challenging data cleaning tasks, transforming messy data into sparkling insights ready to fuel your analysis and inform impactful decisions. So, grab your tools, embrace the challenge, and unleash the power of clean data!

I hope this tutorial inspires you to explore the exciting world of advanced data cleaning and empowers you to tackle any data challenge with confidence!

Hello. So far you’ve
learned about a lot of different tools
and functions that analysts use to clean
up data for analysis. Now we’ll take a step back
and talk about some of the really big picture
aspects of clean data. Knowing how to fix
specific problems, either manually
with spreadsheet tools, or with functions, is
extremely valuable. But it’s also important
to think about how your data has moved between systems and how it’s evolved along it’s journey to your
data analysis project. To do this, data analysts use something called
data mapping. Data mapping is the process of matching fields from one
database to another. This is very important to the
success of data migration, data integration, and lots of other data
management activities. As you learned earlier, different systems store
data in different ways. For example, the state field in one spreadsheet might show
Maryland spelled out. But another spreadsheet
might store it as MD. Data mapping helps us
note these kinds of differences so we know when data is moved and combined it will be compatible. As a quick reminder, compatibility describes how well two or more data sets are
able to work together. The first step to data mapping is identifying what data
needs to be moved. This includes the tables
and the fields within them. We also need to define
the desired format for the data once it reaches
its destination. To figure out how this
works let’s go back to the merger between our two
logistics associations. Starting with the
first data field, we’ll identified that we need to move both sets of member IDs. To define the desired format, we’ll choose whether to use numbers like this spreadsheet, or email addresses like
the other spreadsheet. Next comes mapping the data. Depending on the
schema and number of primary and foreign
keys in a data source, data mapping can be
simple or very complex. As a reminder, a schema is a way of describing how
something is organized. A primary key references a
column in which each value is unique and a foreign key is a field within a table that is a primary key in another table. For more challenging
projects there’s all kinds of data mapping
software programs you can use. These data mapping
tools will analyze field by field how to move data from one place to another then
they automatically clean, match, inspect, and
validate the data. They also create consistent
naming conventions, ensuring compatibility
when the data is transferred from
one source to another. When selecting a software
program to map your data, you want to be sure that it supports the file types
you’re working with, such as Excel, SQL,
Tableau, and others. Later on, you’ll learn more about selecting the right tool
for a particular task. For now, let’s practice
mapping data manually. First, we need to
determine the content of each section to make sure the data ends up in
the right place. For example, the data
on when memberships expire would be consolidated
into a single column. This step makes sure that
each piece of information ends up in the most
appropriate place in the merged data source. Now, you might remember
that some of the data was inconsistent between
the two organizations, like the fact that one uses a separate column for
suite apartment or unit number but
the other doesn’t. This brings us to the next step, transforming the data
into a consistent format. This is a great time
to use concatenate. As you learned before, concatenate is a
function that joins together two or
more text strings, which is what we did earlier with our cosmetics company example. We’ll insert a new column and then type equals concatenate, then the two text strings
we want to make one. Drag that through
the entire column. Now we have the consistency in the new merged association
lists of member addresses. Now that everything’s compatible, it’s time to transfer the
data to its destination. There’s a lot of
different ways to move data from one
place to another, including querying,
import wizards, and even simple drag and drop. Here’s our merged spreadsheet. It looks good, but we
still want to make sure everything was
transferred properly. We’ll go into the testing
phase of data mapping. For this, you inspect
a sample piece of data to confirm that it’s clean
and properly formatted. It’s also a smart practice to do spot checks on things such
as the number of nulls. For the test, you can use a lot of the data cleaning tools
we discussed previously, such as data validation, conditional formatting, COUNTIF,
sorting, and filtering. Finally, once you’ve determined that the data is
clean and compatible, you can start using
it for analysis. Data mapping is so important
because even one mistake when merging data can ripple
throughout an organization, causing the same error to
appear again and again. This leads to poor results. On the other hand, data mapping can save the day by giving you a clear road map you
can follow to make sure your data arrives safely
at it’s destination. That’s why you
learn how to do it.

Practice Quiz: Hands-On Activity: Clean data with spreadsheet functions

Reading: Learning Log: Develop your approach to cleaning data

Practice Quiz: Test your knowledge on cleaning data in spreadsheets

Describe the relationship between a text string and a substring.

A data analyst uses the COUNTIF function to count the number of times a value less than 5 occurs between spreadsheet cells A2 through A100. What is the correct syntax?

Fill in the blank: To remove leading, trailing, and repeated spaces in data, analysts use the ____ function.

Module 2 challenge


Reading: Glossary: Terms and definitions

Quiz: Module 2 challenge

Fill in the blank: Conditional formatting is a spreadsheet tool that changes how _____ appear when values meet a specific condition.

A data analyst at a nonprofit organization is working with the following spreadsheet, which contains member name data in column C. They want to divide this data using the underscore as a delimiter, so that first names are stored in one column and last names in another. Which tool should the analyst use?

For a function to work properly, data analysts must follow each function’s predetermined structure. What is this structure called?

You are working with the following selection of a spreadsheet:

A data analyst in a human resources department is working with the following selection of a spreadsheet:

An analyst is cleaning a new dataset containing 500 rows. They want to make sure the data contained from cell B2 through cell B300 does not contain a number greater than 50. Which of the following COUNTIF function syntaxes could be used to answer this question? Select all that apply.

The V in VLOOKUP stands for what?

To evaluate how well two or more data sources work together, data analysts use data mapping.