Module 4: Verify and report on your cleaning results

Cleaning your data is an essential step in the data analysis process. Verifying and reporting your cleaning is a way to show that your data is ready for the next step. In this part of the course, you’ll find out the processes involved with verifying and reporting data cleaning as well as their benefits.

Learning Objectives

Describe the process involved in verifying the results of cleaning data
Describe what is involved in manually cleaning data
Discuss the elements and importance of data-cleaning reports
Describe the benefits of documenting data cleaning process

Table Of Contents

Manually cleaning data
Documenting results and the cleaning process
Module 4 challenge

Manually cleaning data

Video: Verifying and reporting results

Notes

Video

Tutorial

Transcript

Verification and reporting are crucial steps after data cleaning.

Verification:

Ensures data cleaning was thorough and accurate.
Involves rechecking the dataset, manual cleanup, and reflecting on the project’s goal.
Catches errors like typos or incorrect references before analysis.
Prevents unreliable insights, misrepresented populations, and wasted effort.
Example: A forgotten semicolon could have significantly altered results and business decisions.

Reporting:

Improves transparency and accountability.
Builds trust with teammates and stakeholders.
Aligns everyone on project details.
Strategies: data-cleaning reports, process documentation, change logs.
Change logs track dataset evolution and facilitate communication.
Both verification and reporting help avoid mistakes and save time.

Overall message: Don’t skip verification and reporting. They are essential for reliable data analysis and successful projects.

Tutorial: Verification and Reporting – Securing the Cleanliness of Your Data

Introduction:

Cleaning your data is vital for accurate analysis and insightful conclusions. But the journey doesn’t end there! Verification and reporting are crucial next steps to ensure your data’s integrity and communicate its journey effectively. This tutorial will guide you through these essential processes, equipping you to confidently present clean and reliable data.

Verification:

Revisit the Purpose:

Before diving in, remind yourself of the project’s original goals. Are you sure the cleaned data aligns with those objectives? This can guide your verification efforts.

Double-check the Cleanliness:

Run through your cleaning steps again, meticulously revisiting every transformation and correction. Tools like statistical checks and visualizations can assist you.

Manual Scrutiny:

Don’t rely solely on automation. Sample subsets of data and manually inspect them for lingering errors like typos, inconsistencies, or formatting issues.

Contextual Validation:

Think beyond individual data points. Does the overall distribution and relationships between variables make sense? Do outliers require further investigation?

Compare to Originals:

Contrast your cleaned data with the raw dataset. Can you explain the changes you made and justify their impact on the data’s integrity?

Reporting:

Data Cleaning Report:

Document your cleaning process comprehensively. Outline the tools used, transformations applied, and decisions made. This transparency builds trust and ensures reproducibility.

Changelog:

Maintain a chronological record of every data modification, including additions, corrections, and removals. This helps track the data’s evolution and facilitates collaboration.

Visualizations:

Use compelling graphics like histograms and scatter plots to showcase the cleaned data’s distribution and relationships between variables. This adds clarity and persuasiveness to your report.

Stakeholder Communication:

Tailor your reports and presentations to your audience. Ensure technical details are comprehensible for data experts while providing clear summaries for non-technical stakeholders.

Lessons Learned:

Reflect on your experience. Identify areas for improvement in your cleaning methods and document valuable insights gained from the verification process.

Benefits of Verification and Reporting:

Trustworthy Data: Double-checking ensures your data is reliable for analysis and decision-making.
Reproducible Workflow: Detailed reports allow others to replicate your cleaning process and validate your results.
Efficient Collaboration: Clear communication fosters teamwork and prevents misunderstandings within your team.
Continuous Improvement: Learning from past experiences helps refine your data cleaning skills and avoid future mistakes.

Conclusion:

Verification and reporting are not just formalities; they are pillars of responsible data analysis. By implementing these crucial steps, you can confidently stand behind your clean data, fostering trust, collaboration, and reliable insights. Remember, clean data is not just about the present; it’s about laying a solid foundation for future success.

Bonus Tip: Consider using data management platforms that offer built-in features for data versioning, logging changes, and generating reports. This can streamline your verification and reporting processes and ensure your data’s integrity throughout its lifecycle.

I hope this tutorial provides a valuable roadmap for ensuring your data’s cleanliness and effectively communicating its journey. Remember, clean data is not just a goal; it’s a continuous process with verification and reporting playing key roles in its success.

Hi there, great to have you back. You’ve been learning a lot about the importance of clean data and explored some tools
and strategies to help you throughout
the cleaning process. In these videos, we’ll be covering the next
step in the process: verifying and reporting on the integrity of your clean data. Verification is a
process to confirm that a data cleaning effort was well- executed and the resulting
data is accurate and reliable. It involves rechecking
your clean dataset, doing some manual
clean ups if needed, and taking a moment to
sit back and really think about the original
purpose of the project. That way, you can be
confident that the data you collected is credible and
appropriate for your purposes. Making sure your data is properly verified is so
important because it allows you to double-check
that the work you did to clean up your data was
thorough and accurate. For example, you
might have referenced an incorrect cellphone number or accidentally keyed in a typo. Verification lets you catch mistakes before you
begin analysis. Without it, any
insights you gain from analysis can’t be trusted
for decision-making. You might even risk
misrepresenting populations or damaging the outcome
of a product that you’re actually
trying to improve. I remember working
on a project where I thought the data I
had was sparkling clean because I’d use all the
right tools and processes, but when I went through the steps to verify the data’s integrity, I discovered a semicolon that
I had forgotten to remove. Sounds like a really
tiny error, I know, but if I hadn’t caught the semicolon during
verification and removed it, it would have led to some
big changes in my results. That, of course, could have led to different business decisions. There’s an example of why
verification is so crucial. But that’s not all. The other big part of the verification process is
reporting on your efforts. Open communication is a lifeline for any data analytics project. Reports are a super effective way to show your team that you’re being 100 percent transparent
about your data cleaning. Reporting is also a
great opportunity to show stakeholders that
you’re accountable, build trust with your team, and make sure you’re
all on the same page of important project details. Coming up, you’ll learn different strategies
for reporting, like creating data-
cleaning reports, documenting your
cleaning process, and using something
called the changelog. A changelog is a file containing a chronologically ordered list of modifications made to a project. It’s usually organized
by version and includes the date followed
by a list of added, improved, and removed features. Changelogs are very useful
for keeping track of how a dataset evolved over
the course of a project. They’re also another great way to communicate and report
on data to others. Along the way, you’ll also see some examples of how
verification and reporting can help
you avoid repeating mistakes and save you
and your team time. Ready to get started? Let’s go!

Video: Cleaning and your data expectations

Notes

Video

Transcript

This video highlights the importance of verifying your data cleaning efforts as a crucial step for reliable data analysis and decision-making. It explains what verification is and outlines the key steps involved:

1. Compare Clean vs. Unclean Data:

Review the original dirty data and search for issues addressed during cleaning (e.g., null values, typos).
Use tools like conditional formatting or filters to confirm the absence of these issues in the cleaned data.

2. Take a Big-Picture View:

Re-focus on the original business problem and project goals.
Ensure the data is relevant and capable of solving the problem and achieving those goals.
Consider:
- Business Problem: What are you trying to solve with the data?
- Project Goal: What do you want to achieve with the analysis?
- Data Capability: Can the data truly help you solve the problem and reach the goal?

3. Check for Suspicious or Problematic Data:

Get feedback from others to uncover potential issues you might miss due to familiarity.
Look for discrepancies or inconsistencies that appear illogical.
Example: Finding more survey responses than the sent-out number indicates duplication or data cleaning error.

By verifying your data, you build trust in your analysis and protect your company from making crucial decisions based on unreliable information.

In this video, we’ll discuss how to begin the process
of verifying your data-cleaning efforts. Verification is a critical
part of any analysis project. Without it you have no way of knowing
that your insights can be relied on for data-driven decision-making. Think of verification
as a stamp of approval. To refresh your memory, verification is
a process to confirm that a data-cleaning effort was well-executed and the resulting
data is accurate and reliable. It also involves manually cleaning data
to compare your expectations with what’s actually present. The first step in the verification
process is going back to your original unclean data set and
comparing it to what you have now. Review the dirty data and
try to identify any common problems. For example, maybe you had a lot of nulls. In that case, you check your clean
data to ensure no nulls are present. To do that, you could search
through the data manually or use tools like conditional formatting or
filters. Or maybe there was a common misspelling
like someone keying in the name of a product incorrectly over and over again. In that case, you’d run a FIND in your
clean data to make sure no instances of the misspelled word occur. Another key part of verification involves
taking a big-picture view of your project. This is an opportunity to confirm you’re
actually focusing on the business problem that you need to solve and
the overall project goals and to make sure that your data is
actually capable of solving that problem and achieving those goals. It’s important to take the time to
reset and focus on the big picture because projects can sometimes evolve or transform over time without
us even realizing it. Maybe an e-commerce company decides
to survey 1000 customers to get information that would be
used to improve a product. But as responses begin coming in, the
analysts notice a lot of comments about how unhappy customers are with the
e-commerce website platform altogether. So the analysts start to focus on that. While the customer buying experience
is of course important for any e-commerce business, it wasn’t
the original objective of the project. The analysts in this case
need to take a moment to pause, refocus, and get back to solving
the original problem. Taking a big picture view of your
project involves doing three things. First, consider the business problem
you’re trying to solve with the data. If you’ve lost sight of the problem, you have no way of knowing what
data belongs in your analysis. Taking a problem-first approach to
analytics is essential at all stages of any project. You need to be certain that your data will
actually make it possible to solve your business problem. Second, you need to consider
the goal of the project. It’s not enough just to know that your
company wants to analyze customer feedback about a product. What you really need to know is that the
goal of getting this feedback is to make improvements to that product. On top of that, you also need to know
whether the data you’ve collected and cleaned will actually help your
company achieve that goal. And third, you need to consider whether
your data is capable of solving the problem and
meeting the project objectives. That means thinking about
where the data came from and testing your data collection and
cleaning processes. Sometimes data analysts can be
too familiar with their own data, which makes it easier to miss something or
make assumptions. Asking a teammate to review your
data from a fresh perspective and getting feedback from others is
very valuable in this stage. This is also the time to notice
if anything sticks out to you as suspicious or
potentially problematic in your data. Again, step back,
take a big picture view, and ask yourself, do the numbers make sense? Let’s go back to our
e-commerce company example. Imagine an analyst is
reviewing the cleaned up data from the customer satisfaction survey. The survey was originally sent
to 1,000 customers, but what if the analyst discovers that there is more
than a thousand responses in the data? This could mean that one customer figured
out a way to take the survey more than once. Or it could also mean that something went
wrong in the data cleaning process, and a field was duplicated. Either way, this is a signal that it’s
time to go back to the data-cleaning process and correct the problem. Verifying your data ensures that
the insights you gain from analysis can be trusted. It’s an essential part of data-cleaning
that helps companies avoid big mistakes. This is another place where
data analysts can save the day. Coming up, we’ll go through the next steps in
the data-cleaning process. See you there.

Video: The final step in data cleaning

Notes

Video

Transcript

Summary of Verification in Data Cleaning:

Focus: Ensure data cleaning was thorough and results are reliable.

Previous Steps:

Compared cleaned vs. unclean data.
Manually fixed common errors like extra spaces.

Current Step: Handle more complex errors with tools and functions.

Tools covered:

TRIM: Removes leading, trailing, and repeated spaces.
Remove duplicates: Eliminates duplicate entries in spreadsheets.
Find and replace: Locates and fixes specific data.
Pivot table: Sorts, reorganizes, and summarizes data to detect irregularities.
COUNTA function: Counts total values in a range, including text entries.

Example: Misspelled supplier name (“P-L-O-S” instead of “PLUS”).

Solutions:

Find and replace: Correct all instances of the misspelling (with caution).
Pivot table: Count occurrences of supplier names to check for repeated errors.

Additional notes:

SQL CASE statement: Fixes misspellings in queries based on conditions.
Importance of tracking changes: Recording modifications for transparency and reproducibility.

In conclusion, this video explored advanced techniques for verifying data cleaning, using tools like pivot tables and SQL functions to address complex errors and ensuring reliable data analysis.

Hey there. In this video, we’ll continue building on
the verification process. As a quick reminder, the goal is to ensure that our data-cleaning work was done properly and the results
can be counted on. You want your data to be
verified so you know it’s 100 percent ready to go. It’s like car companies
running tons of tests to make sure a car is safe
before it hits the road. You learned that
the first step in verification is returning to your original, unclean dataset and comparing it to
what you have now. This is an opportunity to
search for common problems. After that, you clean up the problems manually.
For example, by eliminating extra spaces or removing an unwanted
quotation mark. But there’s also
some great tools for fixing common errors
automatically, such as TRIM and
remove duplicates. Earlier, you learned that TRIM is a function that
removes leading, trailing, and repeated
spaces and data. Remove duplicates is a tool
that automatically searches for and eliminates duplicate
entries from a spreadsheet. Now sometimes you
had an error that shows up repeatedly,
and it can’t be resolved with a
quick manual edit or a tool that fixes the
problem automatically. In these cases, it’s helpful
to create a pivot table. A pivot table is a data summarization tool that is used in data processing. Pivot tables sort,
reorganize, group, count, total or average
data stored in a database. We’ll practice that now using the spreadsheet from
a party supply store. Let’s say this company was interested in learning which of its four suppliers is
most cost-effective. An analyst pulled this data on the products the business sells, how many were purchased, which supplier provides them, the cost of the products,
and the ultimate revenue. The data has been cleaned. But during verification,
we noticed that one of the suppliers’ names was
keyed in incorrectly. We could just correct
the word as “plus,” but this might not
solve the problem because we don’t know if this was a one-time occurrence or if the problem’s repeated
throughout the spreadsheet. There are two ways to
answer that question. The first is using
Find and replace. Find and replace is a
tool that looks for a specified search term in a spreadsheet and allows you to replace it
with something else. We’ll choose Edit.
Then Find and replace. We’re trying to find P-L-O-S, the misspelling of “plus”
in the supplier’s name. In some cases you might not
want to replace the data. You just want to find
something. No problem. Just type the search term, leave the rest of the options as default and click “Done.” But right now we do
want to replace it with P-L-U-S. We’ll type that in here. Then click “Replace all” and “Done.” There we go. Our misspelling
has been corrected. That was of course the goal. But for now let’s
undo our Find and replace so we can practice another
way to determine if errors are repeated
throughout a dataset, like with the pivot table. We’ll begin by selecting
the data we want to use. Choose column C. Select
“Data.” Then “Pivot Table.” Choose “New Sheet” and “Create.” We know this company
has four suppliers. If we count the suppliers and the number doesn’t equal four, we know there’s a problem. First, add a row for suppliers. Next, we’ll add a value for our suppliers and
summarize by COUNTA. COUNTA counts the total number of values
within a specified range. Here we’re counting
the number of times a supplier’s name appears in column C. Note that there’s
also function called COUNT, which only counts
the numerical values within a specified range. If we use it here, the result would be zero. Not what we have in mind. But in other special
applications, COUNT would give us information we want for our current example. As you continue learning more about formulas and functions, you’ll discover more
interesting options. If you want to keep learning, search online for spreadsheet
formulas and functions. There’s a lot of great
information out there. Our pivot table has counted
the number of misspellings, and it clearly shows that
the error occurs just once. Otherwise our four suppliers are accurately accounted
for in our data. Now we can correct
the spelling, and we verify that the rest of the
supplier data is clean. This is also useful practice
when querying a database. If you’re working in SQL, you can address misspellings
using a CASE statement. The CASE statement goes through one or more conditions and returns a value as soon
as a condition is met. Let’s discuss how this
works in real life using our customer_name
table. Check out how our customer, Tony Magnolia, shows
up as Tony and Tnoy. Tony’s name was misspelled. Let’s say we want a list
of our customer IDs and the customer’s first
names so we can write personalized notes thanking each customer for their purchase. We don’t want Tony’s note to be addressed incorrectly
to “Tnoy.” Here’s where we can use:
the CASE statement. We’ll start our query with
the basic SQL structure. SELECT, FROM, and WHERE. We know that data comes from the customer_name table in the customer_data dataset, so we can add customer underscore data dot customer underscore name
after FROM. Next, we tell SQL what data
to pull in the SELECT clause. We want customer_id
and first_name. We can go ahead and add customer underscore ID after SELECT. But for our customer’s
first names, we know that Tony was misspelled, so we’ll correct that
using CASE. We’ll add CASE and then WHEN and type first
underscore name equal “Tnoy.” Next we’ll use the THEN
command and type “Tony,” followed by the ELSE command. Here we will type
first underscore name, followed by End As and then we’ll type
cleaned underscore name. Finally, we’re not
filtering our data, so we can eliminate
the WHERE clause. As I mentioned, a CASE statement can cover multiple cases. If we wanted to search for a
few more misspelled names, our statement would look
similar to the original, with some additional
names like this. There you go. Now that you’ve
learned how you can use spreadsheets and SQL to
fix errors automatically, we’ll explore how to keep
track of our changes next.

Reading: Data-cleaning verification: A checklist

Reading

This reading will give you a checklist of common problems you can refer to when doing your data cleaning verification, no matter what tool you are using. When it comes to data cleaning verification, there is no one-size-fits-all approach or a single checklist that can be universally applied to all projects. Each project has its own organization and data requirements that lead to a unique list of things to run through for verification.

Keep in mind, as you receive more data or a better understanding of the project goal(s), you might want to revisit some or all of these steps.

Correct the most common problems

Make sure you identified the most common problems and corrected them, including:

Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset?
Null data: Did you search for NULLs using conditional formatting and filters?
Misspelled words: Did you locate all misspellings?
Mistyped numbers: Did you double-check that your numeric data has been entered correctly?
Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function?
Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?

Mismatched data types: Did you check that numeric, date, and string data are typecast correctly?
Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful?
Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset?
Misleading variable labels (columns): Did you name your columns meaningfully?
Truncated data: Did you check for truncated or missing data that needs correction?
Business Logic: Did you check that the data makes sense given your knowledge of the business?

Review the goal of your project

Once you have finished these data cleaning tasks, it is a good idea to review the goal of your project and confirm that your data is still aligned with that goal. This is a continuous process that you will do throughout your project– but here are three steps you can keep in mind while thinking about this:

Confirm the business problem
Confirm the goal of the project
Verify that data can solve the problem and is aligned to the goal

Practice Quiz: Test your knowledge on manual data cleaning

Making sure data is properly verified is an important part of the data-cleaning process. Which of the following tasks are involved in this verification? Select all that apply.

Manually fixing any errors found in the data
Considering whether the data is credible and appropriate for the project
Rechecking the data-cleaning effort

The verification process confirms that data cleaning was well executed and the resulting data is accurate and reliable. To verify data, analysts recheck the data-cleaning effort, manually fix errors, and consider whether the data is credible and appropriate for the project.

Fill in the blank: To count the total number of spreadsheet values within a specified range, a data analyst uses the _____ function.

COUNTA

To count the total number of spreadsheet values within a specified range, a data analyst uses the COUNTA function.

A data analyst is cleaning a dataset with inconsistent formats and repeated cases. They use the TRIM function to remove extra spaces from string variables. What other tools can they use for data cleaning? Select all that apply.

Remove duplicates, Find and replace

The analyst can use TRIM, remove duplicates, and find and replace for data cleaning.

To correct a typo in a database column, where should you insert a CASE statement in a query?

As a SELECT clause

You should add a CASE statement as a SELECT clause. A CASE statement goes through one or more conditions and returns a value as soon as a condition is met. The typo would be a condition and the correction would be the returned value for the condition.

Documenting results and the cleaning process

Video: Capturing cleaning changes

Notes

Video

Tutorial

Transcript

Summary of Data Cleaning Documentation:

Why Document?

Prevents data-cleaning errors from reappearing.
Informs other analysts about changes made.
Helps assess data quality for analysis.

Documentation methods:

Changelogs:
- Spreadsheets: Use version history and Show Edit History features.
- SQL:
  - Company software.
  - Commit queries with explanations.
  - Add comments while cleaning.
  - Review query history.
Reporting: Share documentation for real-time updates and stakeholder communication.

Benefits:

Improves workflow efficiency.
Enables collaboration and knowledge sharing.
Enhances data reliability and analysis accuracy.

Next: Learn about easy ways to share documentation and impress stakeholders.

Tutorial: Data Cleaning Documentation

Why Document Your Cleaning Process?

Reproducibility: Ensure you and others can replicate the steps taken to clean the data, even months or years later. This is crucial for validating results, addressing errors, and updating analyses.
Transparency: Facilitate collaboration and understanding among team members, allowing others to track changes and make informed decisions about the data.
Quality Assessment: Provide a record of cleaning decisions and rationale, enabling you to assess the quality of the data and its suitability for analysis.
Error Correction: Aid in identifying and fixing errors that may have been introduced during cleaning, preventing inaccurate analyses.

Key Documentation Methods:

Changelogs:
- Create a chronological record of modifications made to the data.
- Include details such as date, time, changes made, and who made them.
- Tools:
  - Spreadsheets: Utilize version history and edit history features.
  - SQL: Use query history, commit queries with explanations, or add comments within code.
  - Consider dedicated software or version control systems (e.g., Git) for complex projects.
Code Comments:
- Write clear and concise comments within your cleaning code to explain the logic and reasoning behind each step.
- This enhances readability and understanding for future users.
Readme Files:
- Create a comprehensive overview of the cleaning process, including:
  - Data sources and format
  - Cleaning steps and rationale
  - Known issues or limitations
  - Assumptions made
  - Instructions for reproducing the cleaning process
Data Dictionaries:
- Define variables, their meaning, data types, and any transformations applied.
- Promote consistency in interpretation and usage.
Visualizations:
- Use charts, graphs, or diagrams to illustrate the cleaning process and the effects of cleaning decisions.
- Enhance understanding and communication of data quality issues.

Best Practices:

Start Early and Update Regularly: Document as you clean, not as an afterthought.
Be Clear and Concise: Use language that is easy to understand and avoid technical jargon.
Focus on Key Decisions and Rationale: Explain why certain choices were made.
Version Control: Track changes and revert to previous versions if needed.
Share and Collaborate: Make documentation accessible to team members and stakeholders.
Tailor to Audience: Adjust the level of detail based on the intended audience.

Remember: Data cleaning documentation is an investment that pays off in terms of reproducibility, transparency, and trust in your data analysis.

Hi again. Now that you’ve learned how to make your
data squeaky clean, it’s time to address all the
dirt you’ve left behind. When you clean your data, all the incorrect or
outdated information is gone, leaving you with the
highest-quality content. But all those changes you made to the data are valuable too. In this video, we’ll discuss why keeping track of changes is important to every data
project and how to document all your
cleaning changes to make sure everyone
stays informed. This involves documentation which is the process of
tracking changes, additions, deletions and errors involved in your data
cleaning effort. You can think of it
like a crime TV show. Crime evidence is found at the scene and passed on
to the forensics team. They analyze every inch of the scene and
document every step, so they can tell a story
with the evidence. A lot of times, the forensic scientist is called to court to testify about that evidence, and they have a detailed report to refer to. The same thing applies
to data cleaning. Data errors are the crime, data cleaning is
gathering evidence, and documentation is detailing exactly what happened for
peer review or court. Having a record of how
a data set evolved does three very
important things. First, it lets us recover
data-cleaning errors. Instead of scratching our heads, trying to remember what we might have done three months ago, we have a cheat sheet
to rely on if we come across the same
errors again later. It’s also a good idea to create a clean table rather than overriding your
existing table. This way, you still
have the original data in case you need to
redo the cleaning. Second, documentation
gives you a way to inform other users of
changes you’ve made. If you ever go on
vacation or get promoted, the analyst who
takes over for you will have a reference
sheet to check in with. Third, documentation helps
you to determine the quality of the data to
be used in analysis. The first two benefits assume
the errors aren’t fixable. But if they are, a record gives the data engineer more
information to refer to. It’s also a great warning for
ourselves that the data set is full of errors and should
be avoided in the future. If the errors were
time-consuming to fix, it might be better to check out alternative data sets
that we can use instead. Data analysts usually use a changelog to access
this information. As a reminder, a changelog
is a file containing a chronologically ordered list of modifications made to a project. You can use and
view a changelog in spreadsheets and SQL to
achieve similar results. Let’s start with the spreadsheet. We can use Sheet’s version history, which provides a
real-time tracker of all the changes and who made them from individual cells to
the entire worksheet. To find this feature, click the File tab, and then select Version history. In the right panel, choose an earlier version. We can find who
edited the file and the changes they made in the
column next to their name. To return to the current version, go to the top left
and click “Back.” If you want to check out
changes in a specific cell, we can right-click and
select Show Edit History. Also, if you want others to be able to browse a
sheet’s version history, you’ll need to assign permission. Now let’s switch gears
and talk about SQL. The way you create and
view a changelog with SQL depends on the software
program you’re using. Some companies even have
their own separate software that keeps track of changelogs
and important SQL queries. This gets pretty advanced. Essentially, all
you have to do is specify exactly what you did and why when you commit a query to the repository as a new
and improved query. This allows the company to revert back to a previous version if something you’ve done
crashes the system, which has happened to me before. Another option is to just add comments as you go while
you’re cleaning data in SQL. This will help you construct your changelog after the fact. For now, we’ll check out query history, which tracks
all the queries you’ve run. You can click on any of
them to revert back to a previous version
of your query or to bring up an older version
to find what you’ve changed. Here’s what we’ve got. I’m in the Query history tab. Listed on the bottom right are all the queries that
run by date and time. You can click on this
icon to the right of each individual query to bring
it up to the Query editor. Changelogs like these are a great way to keep
yourself on track. It also lets your team get real-time updates
when they want them. But there’s another way
to keep the communication flowing, and that’s reporting. Stick around, and you’ll learn
some easy ways to share your documentation and maybe impress your stakeholders
in the process. See you in the next video.

Reading: Embrace changelogs

Reading

What do engineers, writers, and data analysts have in common? Change.

Engineers use engineering change orders (ECOs) to keep track of new product design details and proposed changes to existing products. Writers use document revision histories to keep track of changes to document flow and edits. And data analysts use changelogs to keep track of data transformation and cleaning. Here are some examples of these:

Automated version control takes you most of the way

Most software applications have a kind of history tracking built in. For example, in Google sheets, you can check the version history of an entire sheet or an individual cell and go back to an earlier version. In Microsoft Excel, you can use a feature called Track Changes. And in BigQuery, you can view the history to check what has changed.

Here’s how it works:

Google Sheets	1. Right-click the cell and select Show edit history. 2. Click the left-arrow < or right arrow > to move backward and forward in the history as needed.
Microsoft Excel	1. If Track Changes has been enabled for the spreadsheet: click Review. 2. Under Track Changes, click the Accept/Reject Changes option to accept or reject any change made.
BigQuery	Bring up a previous version (without reverting to it) and figure out what changed by comparing it to the current version.

Changelogs take you down the last mile

A changelog can build on your automated version history by giving you an even more detailed record of your work. This is where data analysts record all the changes they make to the data. Here is another way of looking at it. Version histories record what was done in a data change for a project, but don’t tell us why. Changelogs are super useful for helping us understand the reasons changes have been made. Changelogs have no set format and you can even make your entries in a blank document. But if you are using a shared changelog, it is best to agree with other data analysts on the format of all your log entries.

Typically, a changelog records this type of information:

Data, file, formula, query, or any other component that changed
Description of what changed
Date of the change
Person who made the change
Person who approved the change
Version number
Reason for the change

Let’s say you made a change to a formula in a spreadsheet because you observed it in another report and you wanted your data to match and be consistent. If you found out later that the report was actually using the wrong formula, an automated version history would help you undo the change. But if you also recorded the reason for the change in a changelog, you could go back to the creators of the report and let them know about the incorrect formula. If the change happened a while ago, you might not remember who to follow up with. Fortunately, your changelog would have that information ready for you! By following up, you would ensure data integrity outside your project. You would also be showing personal integrity as someone who can be trusted with data. That is the power of a changelog!

Finally, a changelog is important for when lots of changes to a spreadsheet or query have been made. Imagine an analyst made four changes and the change they want to revert to is change #2. Instead of clicking the undo feature three times to undo change #2 (and losing changes #3 and #4), the analyst can undo just change #2 and keep all the other changes. Now, our example was for just 4 changes, but try to think about how important that changelog would be if there were hundreds of changes to keep track of.

What also happens IRL (in real life)

A junior analyst probably only needs to know the above with one exception. If an analyst is making changes to an existing SQL query that is shared across the company, the company most likely uses what is called a version control system. An example might be a query that pulls daily revenue to build a dashboard for senior management.

Here is how a version control system affects a change to a query:

A company has official versions of important queries in their version control system.
An analyst makes sure the most up-to-date version of the query is the one they will change. This is called syncing
The analyst makes a change to the query.
The analyst might ask someone to review this change. This is called a code review and can be informally or formally done. An informal review could be as simple as asking a senior analyst to take a look at the change.
After a reviewer approves the change, the analyst submits the updated version of the query to a repository in the company’s version control system. This is called a code commit. A best practice is to document exactly what the change was and why it was made in a comments area. Going back to our example of a query that pulls daily revenue, a comment might be: Updated revenue to include revenue coming from the new product, Calypso.
After the change is submitted, everyone else in the company will be able to access and use this new query when they sync to the most up-to-date queries stored in the version control system.
If the query has a problem or business needs change, the analyst can undo the change to the query using the version control system. The analyst can look at a chronological list of all changes made to the query and who made each change. Then, after finding their own change, the analyst can revert to the previous version.
The query is back to what it was before the analyst made the change. And everyone at the company sees this reverted, original query, too.

Practice Quiz: Self-Reflection: Creating a changelog

Reading

Self-Reflection_-Creating-a-changelog Download

Video: Why documentation is important

Notes

Video

Tutorial

Transcript

Summary of Data Cleaning Reporting:

Data analysis, like a crime drama:

Data is the evidence.
Data cleaning is the forensic investigation.
Reporting is presenting findings in court.

Documentation as evidence log:

Tracks changes, deletions, and errors in data cleaning.
Serves as a reference for future analysts and stakeholders.
Examples: changelogs, comments in code.

Reporting the case to stakeholders:

Explains steps taken to clean the data.
Quantifies the impact of changes (e.g., removing duplicates).
Uses clear and concise language.

Benefits of transparency:

Ensures everyone is informed about data quality.
Builds trust and credibility with stakeholders.
Facilitates collaboration and knowledge sharing.

Next: Learn advanced reporting techniques using SQL comments and other methods.

Tutorial: Data Cleaning Reporting

Presenting Your Clean-Up: A Data Analyst’s Testimony

Data cleaning isn’t just about scrubbing the grime from your data; it’s about presenting your discoveries and justifying your decisions. Just like a forensic scientist testifies on the stand, data analysts become storytellers, revealing the hidden truths they’ve unearthed in the data. This tutorial equips you with the tools to build a compelling narrative around your data cleaning efforts.

Why Report?

Transparency: Share your cleaning decisions with stakeholders and collaborators.
Reproducibility: Document your steps for future reference and validation.
Trustworthiness: Showcase your meticulousness and build confidence in your analysis.
Knowledge Sharing: Educate others about the data and potential errors.

Reporting Styles:

Textual Reports:
- List cleaning steps chronologically with clear explanations.
- Highlight major changes and their impact on the data.
- Include tables, charts, and screenshots for visual clarity.
Interactive Notebooks:
- Embed code with explanations to show transformations in action.
- Showcase visualized data before and after cleaning.
- Allow readers to explore and interact with the data analysis.
Presentations:
- Condense key findings into engaging visuals and concise points.
- Explain cleaning decisions in a story-like manner.
- Tailor your message to the audience’s level of technical expertise.

Tips for Effective Reporting:

Start Early: Document your process as you clean, not as an afterthought.
Focus on Impact: Explain how your changes improve data quality and analysis.
Quantify Changes: Use metrics to show the effects of your cleaning (e.g., error reduction, duplicate removal).
Use Clear Language: Avoid technical jargon and explain complex concepts in simple terms.
Structure Your Report: Make it easy to navigate with logical sections and headings.
Proofread and Edit: Ensure accuracy and readability before sharing your report.

Advanced Techniques:

SQL Comments: Embed explanations within your cleaning code for reference.
Version Control: Track changes and revert to previous versions if needed.
Interactive Dashboards: Build dynamic visualizations that update with the data.

Remember: Your data cleaning report is your chance to shine as a data detective. Showcase your meticulous work, explain your discoveries in a compelling way, and build trust in your data-driven conclusions.

Ready to present your case? Go forth and clean, document, and share!

Great, you’re back. Let’s set the stage. The crime is dirty data. We’ve gathered the evidence. It’s been cleaned, verified,
and cleaned again. Now it’s time to
present our evidence. We’ll retrace the steps and present our case to our peers. As we discussed
earlier, data cleaning, verifying, and reporting
is a lot like crime drama. Now it’s our day in court. Just like a forensic scientist testifies on the stand
about the evidence, data analysts are
counted on to present their findings after a
data cleaning effort. Earlier, we learned
how to document and track every step of the
data cleaning process, which means we have solid
information to pull from. As a quick refresher, documentation is the process of tracking changes, additions, deletions, and errors involved
in a data cleaning effort, changelogs are good
example of this. Since it’s staged
chronologically, it provides a real-time
account of every modification. Documenting will be
a huge time saver for you as a future data analyst. It’s basically a cheatsheet you can refer to if
you’re working with the similar data set or need
to address similar errors. While your team can view
changelogs directly, stakeholders can’t and have to rely on your report
to know what you did. Lets check out how
we might document our data cleaning process using example we
worked with earlier. In that example, we found
that this association had two instances of
the same membership for $500 in its database. We decided to fix this manually by deleting
the duplicate info. There’re plenty of
ways we could go about documenting what we did. One common way is to
just create a doc listing out the steps we took
and the impact they had. For example, first on your list would be that you remove the duplicate instance, which decreased the number
of rows from 33 to 32, and lowered the
membership total by $500. If we were working with SQL, we could include a comment in the statement describing
the reason for a change without affecting the execution of the statement. That’s something a
bit more advanced, which we’ll talk about later. Regardless of how we capture
and share our changelogs, we’re setting ourselves
up for success by being 100 percent transparent
about our data cleaning. This keeps everyone on
the same page and shows project stakeholders that we are accountable for
effective processes. In other words, this helps build our credibility
as witnesses who can be trusted to present all the evidence accurately
during testimony. For dirty data, it’s
an open and shut case.

Video: Feedback and cleaning

Notes

Video

Tutorial

Transcript

Summary of Data Cleaning Feedback and Application:

Beyond Validation:

Data cleaning provides insights beyond just data accuracy.
Error patterns revealed during cleaning can be used to improve data collection processes.

Common Data Errors:

Human mistakes (typos, misspellings)
Flawed processes (survey design)
System issues (integration errors)

Data Cleaning as Feedback Loop:

Documentation and reporting enable error pattern identification.
Feedback used to adjust data collection procedures and quality control.
Examples: reprogramming data collection, updating survey questions, data engineer meetings.

Benefits of Error Reduction:

Increased trust in data for decision-making.
Improved data collection efficiency.
Potential for increased revenue.

Next Steps:

Continue building data validation skills.
Apply these lessons to future data cleaning projects.

Tutorial: Data Cleaning Feedback and Application

Beyond the Report: Transforming Data Insights into Action

Congratulations! You’ve meticulously cleaned, documented, and reported on your data. But the journey doesn’t end there. The real treasure lies in leveraging the feedback from your cleaning process to optimize future data collection and analysis. This tutorial will guide you through how to transform data cleaning insights into actionable improvements.

From Glitches to Growth:

Data cleaning isn’t just about removing dirt; it’s about uncovering gems. The errors and inconsistencies you encounter reveal valuable information about your data collection and processing systems. By analyzing these “glitches,” you can identify opportunities for growth and improvement.

Identifying Error Patterns:

Human Errors: Look for patterns in typos, misspellings, and inconsistent formatting. This might indicate poorly designed data entry interfaces or training needs for data collectors.
Process Errors: Analyze error frequency in specific fields or survey questions. This could expose flaws in survey design, data validation rules, or data transfer procedures.
System Errors: Track inconsistencies arising from data integration or software issues. This often requires collaboration with data engineers to diagnose and fix underlying system problems.

Turning Feedback into Action:

Data Collection Optimization: Use your findings to redesign data entry forms, improve user interfaces, or implement stricter validation rules.
Survey Reform: Identify poorly worded questions, ambiguous choices, or missing answer options. Revise your survey instrument based on these insights.
Quality Control Enhancement: Schedule regular data audits, implement automated error detection algorithms, or establish data cleaning protocols for specific data sources.
System Upgrades: Collaborate with data engineers to address integration issues, fix buggy software, or implement data cleansing algorithms at the source.

Communication is Key:

Share your feedback and proposed improvements with stakeholders, data owners, and data engineering teams. Clear communication and collaboration are crucial for implementing effective changes.

The Ripple Effect of Better Data:

By refining your data collection and processing systems, you’ll reap the rewards of cleaner, more reliable data. This, in turn, leads to:

Accurate Analyses: More trustworthy data fuels confident decision-making and impactful business strategies.
Efficient Operations: Reduced errors and improved data quality streamline workflows and save time and resources.
Increased Revenue: Insightful data analysis can uncover new opportunities for growth, market expansion, and cost optimization.

Remember: Data cleaning is not just a chore; it’s a powerful tool for continuous improvement. Embrace the feedback it offers, refine your data collection systems, and empower your organization to make data-driven decisions that lead to success.

Start applying these techniques to your next data cleaning project and watch your data transform from messy to marvelous!

Welcome back. By now it’s safe to say that verifying,
documenting and reporting are valuable steps
in the data-cleaning process. You have proof to give stakeholders
that your data is accurate and reliable. And the effort to attain it was
well-executed and documented. The next step is getting feedback about
the evidence and using it for good, which we’ll cover in this video. Clean data is important
to the task at hand. But the data-cleaning process itself
can reveal insights that are helpful to a business. The feedback we get when we report on our
cleaning can transform data collection processes, and
ultimately business development. For example, one of the biggest challenges of working
with data is dealing with errors. Some of the most common errors involve
human mistakes like mistyping or misspelling, flawed processes like poor
design of a survey form, and system issues where older
systems integrate data incorrectly. Whatever the reason, data-cleaning
can shine a light on the nature and severity of error-generating processes. With consistent documentation and
reporting, we can uncover error patterns in data
collection and entry procedures and use the feedback we get to make
sure common errors aren’t repeated. Maybe we need to reprogram
the way the data is collected or change specific questions
on the survey form. In more extreme cases, the feedback we get can even send
us back to the drawing board to rethink expectations and possibly
update quality control procedures. For example, sometimes it’s useful to
schedule a meeting with a data engineer or data owner to make sure
the data is brought in properly and doesn’t require constant cleaning. Once errors have been identified and
addressed, stakeholders have data they can trust for
decision-making. And by reducing errors and
inefficiencies in data collection, the company just might discover
big increases to its bottom line. Congratulations! You now have the foundation you need
to successfully verify a report on your cleaning results. Stay tuned to keep building
on your new skills.

Reading: Advanced functions for speedy data cleaning

Reading

In this reading, you will learn about some advanced functions that can help you speed up the data cleaning process in spreadsheets. Below is a table summarizing three functions and what they do:

Keeping data clean and in sync with a source

The IMPORTRANGE function in Google Sheets and the Paste Link feature (a Paste Special option in Microsoft Excel) both allow you to insert data from one sheet to another. Using these on a large amount of data is more efficient than manual copying and pasting. They also reduce the chance of errors being introduced by copying and pasting the wrong data. They are also helpful for data cleaning because you can “cherry pick” the data you want to analyze and leave behind the data that isn’t relevant to your project. Basically, it is like canceling noise from your data so you can focus on what is most important to solve your problem. This functionality is also useful for day-to-day data monitoring; with it, you can build a tracking spreadsheet to share the relevant data with others. The data is synced with the data source so when the data is updated in the source file, the tracked data is also refreshed.

In Google Sheets, you can use the IMPORTRANGE function. It enables you to specify a range of cells in the other spreadsheet to duplicate in the spreadsheet you are working in. You must allow access to the spreadsheet containing the data the first time you import the data.

The URL shown below is for syntax purposes only. Don’t enter it in your own spreadsheet. Replace it with a URL to a spreadsheet you have created so you can control access to it by clicking the Allow access button.

Refer to the Google support page for IMPORTRANGE for the sample usage and syntax.

Example of using IMPORTRANGE

An analyst monitoring a fundraiser needs to track and ensure that matching funds are distributed. They use IMPORTRANGE to pull all the matching transactions into a spreadsheet containing all of the individual donations. This enables them to determine which donations eligible for matching funds still need to be processed. Because the total number of matching transactions increases daily, they simply need to change the range used by the function to import the most up-to-date data.

On Tuesday, they use the following to import the donor names and matched amounts:

=IMPORTRANGE(“https://docs.google.com/spreadsheets/d/abcd123abcd123″, “sheet1!A1:C10”, “Matched Funds!A1:B4001”)

On Wednesday, another 500 transactions were processed. They increase the range used by 500 to easily include the latest transactions when importing the data to the individual donor spreadsheet:

=IMPORTRANGE(“https://docs.google.com/spreadsheets/d/abcd123abcd123”, “Matched Funds!A1:B4501”)

Note: The above examples are for illustrative purposes only. Don’t copy and paste them into your spreadsheet. To try it out yourself, you will need to substitute your own URL (and sheet name if you have multiple tabs) along with the range of cells in the spreadsheet that you have populated with data.

Pulling data from other data sources

The QUERY function is also useful when you want to pull data from another spreadsheet. The QUERY function’s SQL-like ability can extract specific data within a spreadsheet. For a large amount of data, using the QUERY function is faster than filtering data manually. This is especially true when repeated filtering is required. For example, you could generate a list of all customers who bought your company’s products in a particular month using manual filtering. But if you also want to figure out customer growth month over month, you have to copy the filtered data to a new spreadsheet, filter the data for sales during the following month, and then copy those results for the analysis. With the QUERY function, you can get all the data for both months without a need to change your original dataset or copy results.

The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and the range of data that you want to query from, and then use the SQL SELECT command to select the specific columns. You can also add specific criteria after the SELECT statement by including a WHERE statement. But remember, all of the SQL code you use has to be placed between the quotes!

Google Sheets run the Google Visualization API Query Language across the data. Excel spreadsheets use a query wizard to guide you through the steps to connect to a data source and select the tables. In either case, you are able to be sure that the data imported is verified and clean based on the criteria in the query.

Examples of using QUERY

Check out the Google support page for the QUERY function with sample usage, syntax, and examples you can download in a Google sheet.

Link to make a copy of the sheet: QUERY examples

The solution

Analysts can use SQL to pull a specific dataset into a spreadsheet. They can then use the QUERY function to create multiple tabs (views) of that dataset. For example, one tab could contain all the sales data for a particular month and another tab could contain all the sales data from a specific region. This solution illustrates how SQL and spreadsheets are used well together.

Filtering data to get what you want

The FILTER function is fully internal to a spreadsheet and doesn’t require the use of a query language. The FILTER function lets you view only the rows (or columns) in the source data that meet your specified conditions. It makes it possible to pre-filter data before you analyze it.

The FILTER function might run faster than the QUERY function. But keep in mind, the QUERY function can be combined with other functions for more complex calculations. For example, the QUERY function can be used with other functions like SUM and COUNT to summarize data, but the FILTER function can’t.

Example of using FILTER

Check out the Google support page for the FILTER function with sample usage, syntax, and examples you can download in a Google sheet.

Link to make a copy of the sheet: FILTER examples

Practice Quiz: Test your knowledge on documenting the cleaning process

Why is it important for a data analyst to document the evolution of a dataset? Select all that apply.

To recover data-cleaning errors. To determine the quality of the data. To inform other users of changes

It is important to document the evolution of a dataset in order to recover data-cleaning errors, inform other users of changes, and determine the quality of the data.

Fill in the blank: While cleaning data, documentation is used to track _____. Select all that apply.

deletions. changes. errors

While cleaning data, documentation is used to track changes, deletions, and errors.

Documenting data-cleaning makes it possible to achieve what goals? Select all that apply.

Keep team members on the same page.
Demonstrate to project stakeholders that you are accountable
Be transparent about your process

Documenting data-cleaning makes it possible to be transparent about your process, keep team members on the same page, and demonstrate to project stakeholders that you are accountable.

Module 4 challenge

Reading: Glossary: Terms and definitions

Reading

Course-4-Module-4-Glossary-_-DA-terms-and-definitions Download

Quiz: Module 4 challenge

What is involved in seeing the big picture when verifying data cleaning? Select all that apply.

Consider the business problem, Consider the data, Consider the goal

To see the big picture when verifying data cleaning, consider the business problem, the goal, and the data.

Fill in the blank: TRIM is a function that removes _____ spaces in data. Select all that apply.

leading, repeated, trailing

TRIM is a function that removes leading, trailing, and repeated spaces in data.

A data analyst uses the COUNTA function to count which of the following?

The total number of values within a specified range

A data analyst uses the COUNTA function to count the total number of values within a specified range.

Fill in the blank: A data analyst uses the CASE statement to consider one or more _____, then returns a value.

conditions

A data analyst uses the CASE statement to consider one or more conditions, then return a value.

Fill in the blank: Documentation is the process of tracking _____ during data cleaning. Select all that apply.

deletions, additions, changes

Documentation is the process of tracking changes, additions, deletions, and errors during data cleaning.

At what point during the analysis process does a data analyst use a changelog?

While cleaning the data

A data analyst uses a changelog while cleaning data.

A data analyst commits a query to the repository as a new and improved query. Then, they specify the changes they made and why they made them. This scenario is part of what process?

Creating a changelog

Specifying the changes an analyst made and why they made them is part of creating a changelog.

Home » Google Career Certificates » Google Data Analytics Professional Certificate » Process Data from Dirty to Clean » Module 4: Verify and report on your cleaning results

Module 4: Verify and report on your cleaning results

Manually cleaning data

Video: Verifying and reporting results

Video: Cleaning and your data expectations

Video: The final step in data cleaning

Reading: Data-cleaning verification: A checklist

Correct the most common problems

Review the goal of your project

Practice Quiz: Test your knowledge on manual data cleaning

Documenting results and the cleaning process

Video: Capturing cleaning changes

Reading: Embrace changelogs

Automated version control takes you most of the way

Changelogs take you down the last mile

What also happens IRL (in real life)

Practice Quiz: Self-Reflection: Creating a changelog

Video: Why documentation is important

Video: Feedback and cleaning

Reading: Advanced functions for speedy data cleaning

Keeping data clean and in sync with a source

Pulling data from other data sources

Filtering data to get what you want

Practice Quiz: Test your knowledge on documenting the cleaning process

Module 4 challenge

Reading: Glossary: Terms and definitions

Quiz: Module 4 challenge

Share this:

Like this: