Skip to content
Home » Google Career Certificates » Google Data Analytics Professional Certificate » Process Data from Dirty to Clean » Module 4: Verify and report on your cleaning results

Module 4: Verify and report on your cleaning results

Cleaning your data is an essential step in the data analysis process. Verifying and reporting your cleaning is a way to show that your data is ready for the next step. In this part of the course, you’ll find out the processes involved with verifying and reporting data cleaning as well as their benefits.

Learning Objectives

  • Describe the process involved in verifying the results of cleaning data
  • Describe what is involved in manually cleaning data
  • Discuss the elements and importance of data-cleaning reports
  • Describe the benefits of documenting data cleaning process

Manually cleaning data


Video: Verifying and reporting results

Verification and reporting are crucial steps after data cleaning.

Verification:

  • Ensures data cleaning was thorough and accurate.
  • Involves rechecking the dataset, manual cleanup, and reflecting on the project’s goal.
  • Catches errors like typos or incorrect references before analysis.
  • Prevents unreliable insights, misrepresented populations, and wasted effort.
  • Example: A forgotten semicolon could have significantly altered results and business decisions.

Reporting:

  • Improves transparency and accountability.
  • Builds trust with teammates and stakeholders.
  • Aligns everyone on project details.
  • Strategies: data-cleaning reports, process documentation, change logs.
  • Change logs track dataset evolution and facilitate communication.
  • Both verification and reporting help avoid mistakes and save time.

Overall message: Don’t skip verification and reporting. They are essential for reliable data analysis and successful projects.

Tutorial: Verification and Reporting – Securing the Cleanliness of Your Data

Introduction:

Cleaning your data is vital for accurate analysis and insightful conclusions. But the journey doesn’t end there! Verification and reporting are crucial next steps to ensure your data’s integrity and communicate its journey effectively. This tutorial will guide you through these essential processes, equipping you to confidently present clean and reliable data.

Verification:

  1. Revisit the Purpose:

Before diving in, remind yourself of the project’s original goals. Are you sure the cleaned data aligns with those objectives? This can guide your verification efforts.

  1. Double-check the Cleanliness:

Run through your cleaning steps again, meticulously revisiting every transformation and correction. Tools like statistical checks and visualizations can assist you.

  1. Manual Scrutiny:

Don’t rely solely on automation. Sample subsets of data and manually inspect them for lingering errors like typos, inconsistencies, or formatting issues.

  1. Contextual Validation:

Think beyond individual data points. Does the overall distribution and relationships between variables make sense? Do outliers require further investigation?

  1. Compare to Originals:

Contrast your cleaned data with the raw dataset. Can you explain the changes you made and justify their impact on the data’s integrity?

Reporting:

  1. Data Cleaning Report:

Document your cleaning process comprehensively. Outline the tools used, transformations applied, and decisions made. This transparency builds trust and ensures reproducibility.

  1. Changelog:

Maintain a chronological record of every data modification, including additions, corrections, and removals. This helps track the data’s evolution and facilitates collaboration.

  1. Visualizations:

Use compelling graphics like histograms and scatter plots to showcase the cleaned data’s distribution and relationships between variables. This adds clarity and persuasiveness to your report.

  1. Stakeholder Communication:

Tailor your reports and presentations to your audience. Ensure technical details are comprehensible for data experts while providing clear summaries for non-technical stakeholders.

  1. Lessons Learned:

Reflect on your experience. Identify areas for improvement in your cleaning methods and document valuable insights gained from the verification process.

Benefits of Verification and Reporting:

  • Trustworthy Data: Double-checking ensures your data is reliable for analysis and decision-making.
  • Reproducible Workflow: Detailed reports allow others to replicate your cleaning process and validate your results.
  • Efficient Collaboration: Clear communication fosters teamwork and prevents misunderstandings within your team.
  • Continuous Improvement: Learning from past experiences helps refine your data cleaning skills and avoid future mistakes.

Conclusion:

Verification and reporting are not just formalities; they are pillars of responsible data analysis. By implementing these crucial steps, you can confidently stand behind your clean data, fostering trust, collaboration, and reliable insights. Remember, clean data is not just about the present; it’s about laying a solid foundation for future success.

Bonus Tip: Consider using data management platforms that offer built-in features for data versioning, logging changes, and generating reports. This can streamline your verification and reporting processes and ensure your data’s integrity throughout its lifecycle.

I hope this tutorial provides a valuable roadmap for ensuring your data’s cleanliness and effectively communicating its journey. Remember, clean data is not just a goal; it’s a continuous process with verification and reporting playing key roles in its success.

Hi there, great to have you back. You’ve been learning a lot about the importance of clean data and explored some tools
and strategies to help you throughout
the cleaning process. In these videos, we’ll be covering the next
step in the process: verifying and reporting on the integrity of your clean data. Verification is a
process to confirm that a data cleaning effort was well- executed and the resulting
data is accurate and reliable. It involves rechecking
your clean dataset, doing some manual
clean ups if needed, and taking a moment to
sit back and really think about the original
purpose of the project. That way, you can be
confident that the data you collected is credible and
appropriate for your purposes. Making sure your data is properly verified is so
important because it allows you to double-check
that the work you did to clean up your data was
thorough and accurate. For example, you
might have referenced an incorrect cellphone number or accidentally keyed in a typo. Verification lets you catch mistakes before you
begin analysis. Without it, any
insights you gain from analysis can’t be trusted
for decision-making. You might even risk
misrepresenting populations or damaging the outcome
of a product that you’re actually
trying to improve. I remember working
on a project where I thought the data I
had was sparkling clean because I’d use all the
right tools and processes, but when I went through the steps to verify the data’s integrity, I discovered a semicolon that
I had forgotten to remove. Sounds like a really
tiny error, I know, but if I hadn’t caught the semicolon during
verification and removed it, it would have led to some
big changes in my results. That, of course, could have led to different business decisions. There’s an example of why
verification is so crucial. But that’s not all. The other big part of the verification process is
reporting on your efforts. Open communication is a lifeline for any data analytics project. Reports are a super effective way to show your team that you’re being 100 percent transparent
about your data cleaning. Reporting is also a
great opportunity to show stakeholders that
you’re accountable, build trust with your team, and make sure you’re
all on the same page of important project details. Coming up, you’ll learn different strategies
for reporting, like creating data-
cleaning reports, documenting your
cleaning process, and using something
called the changelog. A changelog is a file containing a chronologically ordered list of modifications made to a project. It’s usually organized
by version and includes the date followed
by a list of added, improved, and removed features. Changelogs are very useful
for keeping track of how a dataset evolved over
the course of a project. They’re also another great way to communicate and report
on data to others. Along the way, you’ll also see some examples of how
verification and reporting can help
you avoid repeating mistakes and save you
and your team time. Ready to get started? Let’s go!

Video: Cleaning and your data expectations

This video highlights the importance of verifying your data cleaning efforts as a crucial step for reliable data analysis and decision-making. It explains what verification is and outlines the key steps involved:

1. Compare Clean vs. Unclean Data:

  • Review the original dirty data and search for issues addressed during cleaning (e.g., null values, typos).
  • Use tools like conditional formatting or filters to confirm the absence of these issues in the cleaned data.

2. Take a Big-Picture View:

  • Re-focus on the original business problem and project goals.
  • Ensure the data is relevant and capable of solving the problem and achieving those goals.
  • Consider:
    • Business Problem: What are you trying to solve with the data?
    • Project Goal: What do you want to achieve with the analysis?
    • Data Capability: Can the data truly help you solve the problem and reach the goal?

3. Check for Suspicious or Problematic Data:

  • Get feedback from others to uncover potential issues you might miss due to familiarity.
  • Look for discrepancies or inconsistencies that appear illogical.
  • Example: Finding more survey responses than the sent-out number indicates duplication or data cleaning error.

By verifying your data, you build trust in your analysis and protect your company from making crucial decisions based on unreliable information.

In this video, we’ll discuss how to begin the process
of verifying your data-cleaning efforts. Verification is a critical
part of any analysis project. Without it you have no way of knowing
that your insights can be relied on for data-driven decision-making. Think of verification
as a stamp of approval. To refresh your memory, verification is
a process to confirm that a data-cleaning effort was well-executed and the resulting
data is accurate and reliable. It also involves manually cleaning data
to compare your expectations with what’s actually present. The first step in the verification
process is going back to your original unclean data set and
comparing it to what you have now. Review the dirty data and
try to identify any common problems. For example, maybe you had a lot of nulls. In that case, you check your clean
data to ensure no nulls are present. To do that, you could search
through the data manually or use tools like conditional formatting or
filters. Or maybe there was a common misspelling
like someone keying in the name of a product incorrectly over and over again. In that case, you’d run a FIND in your
clean data to make sure no instances of the misspelled word occur. Another key part of verification involves
taking a big-picture view of your project. This is an opportunity to confirm you’re
actually focusing on the business problem that you need to solve and
the overall project goals and to make sure that your data is
actually capable of solving that problem and achieving those goals. It’s important to take the time to
reset and focus on the big picture because projects can sometimes evolve or transform over time without
us even realizing it. Maybe an e-commerce company decides
to survey 1000 customers to get information that would be
used to improve a product. But as responses begin coming in, the
analysts notice a lot of comments about how unhappy customers are with the
e-commerce website platform altogether. So the analysts start to focus on that. While the customer buying experience
is of course important for any e-commerce business, it wasn’t
the original objective of the project. The analysts in this case
need to take a moment to pause, refocus, and get back to solving
the original problem. Taking a big picture view of your
project involves doing three things. First, consider the business problem
you’re trying to solve with the data. If you’ve lost sight of the problem, you have no way of knowing what
data belongs in your analysis. Taking a problem-first approach to
analytics is essential at all stages of any project. You need to be certain that your data will
actually make it possible to solve your business problem. Second, you need to consider
the goal of the project. It’s not enough just to know that your
company wants to analyze customer feedback about a product. What you really need to know is that the
goal of getting this feedback is to make improvements to that product. On top of that, you also need to know
whether the data you’ve collected and cleaned will actually help your
company achieve that goal. And third, you need to consider whether
your data is capable of solving the problem and
meeting the project objectives. That means thinking about
where the data came from and testing your data collection and
cleaning processes. Sometimes data analysts can be
too familiar with their own data, which makes it easier to miss something or
make assumptions. Asking a teammate to review your
data from a fresh perspective and getting feedback from others is
very valuable in this stage. This is also the time to notice
if anything sticks out to you as suspicious or
potentially problematic in your data. Again, step back,
take a big picture view, and ask yourself, do the numbers make sense? Let’s go back to our
e-commerce company example. Imagine an analyst is
reviewing the cleaned up data from the customer satisfaction survey. The survey was originally sent
to 1,000 customers, but what if the analyst discovers that there is more
than a thousand responses in the data? This could mean that one customer figured
out a way to take the survey more than once. Or it could also mean that something went
wrong in the data cleaning process, and a field was duplicated. Either way, this is a signal that it’s
time to go back to the data-cleaning process and correct the problem. Verifying your data ensures that
the insights you gain from analysis can be trusted. It’s an essential part of data-cleaning
that helps companies avoid big mistakes. This is another place where
data analysts can save the day. Coming up, we’ll go through the next steps in
the data-cleaning process. See you there.

Video: The final step in data cleaning

Summary of Verification in Data Cleaning:

Focus: Ensure data cleaning was thorough and results are reliable.

Previous Steps:

  • Compared cleaned vs. unclean data.
  • Manually fixed common errors like extra spaces.

Current Step: Handle more complex errors with tools and functions.

Tools covered:

  • TRIM: Removes leading, trailing, and repeated spaces.
  • Remove duplicates: Eliminates duplicate entries in spreadsheets.
  • Find and replace: Locates and fixes specific data.
  • Pivot table: Sorts, reorganizes, and summarizes data to detect irregularities.
  • COUNTA function: Counts total values in a range, including text entries.

Example: Misspelled supplier name (“P-L-O-S” instead of “PLUS”).

Solutions:

  • Find and replace: Correct all instances of the misspelling (with caution).
  • Pivot table: Count occurrences of supplier names to check for repeated errors.

Additional notes:

  • SQL CASE statement: Fixes misspellings in queries based on conditions.
  • Importance of tracking changes: Recording modifications for transparency and reproducibility.

In conclusion, this video explored advanced techniques for verifying data cleaning, using tools like pivot tables and SQL functions to address complex errors and ensuring reliable data analysis.

Hey there. In this video, we’ll continue building on
the verification process. As a quick reminder, the goal is to ensure that our data-cleaning work was done properly and the results
can be counted on. You want your data to be
verified so you know it’s 100 percent ready to go. It’s like car companies
running tons of tests to make sure a car is safe
before it hits the road. You learned that
the first step in verification is returning to your original, unclean dataset and comparing it to
what you have now. This is an opportunity to
search for common problems. After that, you clean up the problems manually.
For example, by eliminating extra spaces or removing an unwanted
quotation mark. But there’s also
some great tools for fixing common errors
automatically, such as TRIM and
remove duplicates. Earlier, you learned that TRIM is a function that
removes leading, trailing, and repeated
spaces and data. Remove duplicates is a tool
that automatically searches for and eliminates duplicate
entries from a spreadsheet. Now sometimes you
had an error that shows up repeatedly,
and it can’t be resolved with a
quick manual edit or a tool that fixes the
problem automatically. In these cases, it’s helpful
to create a pivot table. A pivot table is a data summarization tool that is used in data processing. Pivot tables sort,
reorganize, group, count, total or average
data stored in a database. We’ll practice that now using the spreadsheet from
a party supply store. Let’s say this company was interested in learning which of its four suppliers is
most cost-effective. An analyst pulled this data on the products the business sells, how many were purchased, which supplier provides them, the cost of the products,
and the ultimate revenue. The data has been cleaned. But during verification,
we noticed that one of the suppliers’ names was
keyed in incorrectly. We could just correct
the word as “plus,” but this might not
solve the problem because we don’t know if this was a one-time occurrence or if the problem’s repeated
throughout the spreadsheet. There are two ways to
answer that question. The first is using
Find and replace. Find and replace is a
tool that looks for a specified search term in a spreadsheet and allows you to replace it
with something else. We’ll choose Edit.
Then Find and replace. We’re trying to find P-L-O-S, the misspelling of “plus”
in the supplier’s name. In some cases you might not
want to replace the data. You just want to find
something. No problem. Just type the search term, leave the rest of the options as default and click “Done.” But right now we do
want to replace it with P-L-U-S. We’ll type that in here. Then click “Replace all” and “Done.” There we go. Our misspelling
has been corrected. That was of course the goal. But for now let’s
undo our Find and replace so we can practice another
way to determine if errors are repeated
throughout a dataset, like with the pivot table. We’ll begin by selecting
the data we want to use. Choose column C. Select
“Data.” Then “Pivot Table.” Choose “New Sheet” and “Create.” We know this company
has four suppliers. If we count the suppliers and the number doesn’t equal four, we know there’s a problem. First, add a row for suppliers. Next, we’ll add a value for our suppliers and
summarize by COUNTA. COUNTA counts the total number of values
within a specified range. Here we’re counting
the number of times a supplier’s name appears in column C. Note that there’s
also function called COUNT, which only counts
the numerical values within a specified range. If we use it here, the result would be zero. Not what we have in mind. But in other special
applications, COUNT would give us information we want for our current example. As you continue learning more about formulas and functions, you’ll discover more
interesting options. If you want to keep learning, search online for spreadsheet
formulas and functions. There’s a lot of great
information out there. Our pivot table has counted
the number of misspellings, and it clearly shows that
the error occurs just once. Otherwise our four suppliers are accurately accounted
for in our data. Now we can correct
the spelling, and we verify that the rest of the
supplier data is clean. This is also useful practice
when querying a database. If you’re working in SQL, you can address misspellings
using a CASE statement. The CASE statement goes through one or more conditions and returns a value as soon
as a condition is met. Let’s discuss how this
works in real life using our customer_name
table. Check out how our customer, Tony Magnolia, shows
up as Tony and Tnoy. Tony’s name was misspelled. Let’s say we want a list
of our customer IDs and the customer’s first
names so we can write personalized notes thanking each customer for their purchase. We don’t want Tony’s note to be addressed incorrectly
to “Tnoy.” Here’s where we can use:
the CASE statement. We’ll start our query with
the basic SQL structure. SELECT, FROM, and WHERE. We know that data comes from the customer_name table in the customer_data dataset, so we can add customer underscore data dot customer underscore name
after FROM. Next, we tell SQL what data
to pull in the SELECT clause. We want customer_id
and first_name. We can go ahead and add customer underscore ID after SELECT. But for our customer’s
first names, we know that Tony was misspelled, so we’ll correct that
using CASE. We’ll add CASE and then WHEN and type first
underscore name equal “Tnoy.” Next we’ll use the THEN
command and type “Tony,” followed by the ELSE command. Here we will type
first underscore name, followed by End As and then we’ll type
cleaned underscore name. Finally, we’re not
filtering our data, so we can eliminate
the WHERE clause. As I mentioned, a CASE statement can cover multiple cases. If we wanted to search for a
few more misspelled names, our statement would look
similar to the original, with some additional
names like this. There you go. Now that you’ve
learned how you can use spreadsheets and SQL to
fix errors automatically, we’ll explore how to keep
track of our changes next.

Reading: Data-cleaning verification: A checklist

Reading

Practice Quiz: Test your knowledge on manual data cleaning

Making sure data is properly verified is an important part of the data-cleaning process. Which of the following tasks are involved in this verification? Select all that apply.

Fill in the blank: To count the total number of spreadsheet values within a specified range, a data analyst uses the _____ function.

A data analyst is cleaning a dataset with inconsistent formats and repeated cases. They use the TRIM function to remove extra spaces from string variables. What other tools can they use for data cleaning? Select all that apply.

To correct a typo in a database column, where should you insert a CASE statement in a query?

Documenting results and the cleaning process


Video: Capturing cleaning changes

Summary of Data Cleaning Documentation:

Why Document?

  • Prevents data-cleaning errors from reappearing.
  • Informs other analysts about changes made.
  • Helps assess data quality for analysis.

Documentation methods:

  1. Changelogs:
    • Spreadsheets: Use version history and Show Edit History features.
    • SQL:
      • Company software.
      • Commit queries with explanations.
      • Add comments while cleaning.
      • Review query history.
  2. Reporting: Share documentation for real-time updates and stakeholder communication.

Benefits:

  • Improves workflow efficiency.
  • Enables collaboration and knowledge sharing.
  • Enhances data reliability and analysis accuracy.

Next: Learn about easy ways to share documentation and impress stakeholders.

Tutorial: Data Cleaning Documentation

Why Document Your Cleaning Process?

  • Reproducibility: Ensure you and others can replicate the steps taken to clean the data, even months or years later. This is crucial for validating results, addressing errors, and updating analyses.
  • Transparency: Facilitate collaboration and understanding among team members, allowing others to track changes and make informed decisions about the data.
  • Quality Assessment: Provide a record of cleaning decisions and rationale, enabling you to assess the quality of the data and its suitability for analysis.
  • Error Correction: Aid in identifying and fixing errors that may have been introduced during cleaning, preventing inaccurate analyses.

Key Documentation Methods:

  1. Changelogs:
    • Create a chronological record of modifications made to the data.
    • Include details such as date, time, changes made, and who made them.
    • Tools:
      • Spreadsheets: Utilize version history and edit history features.
      • SQL: Use query history, commit queries with explanations, or add comments within code.
      • Consider dedicated software or version control systems (e.g., Git) for complex projects.
  2. Code Comments:
    • Write clear and concise comments within your cleaning code to explain the logic and reasoning behind each step.
    • This enhances readability and understanding for future users.
  3. Readme Files:
    • Create a comprehensive overview of the cleaning process, including:
      • Data sources and format
      • Cleaning steps and rationale
      • Known issues or limitations
      • Assumptions made
      • Instructions for reproducing the cleaning process
  4. Data Dictionaries:
    • Define variables, their meaning, data types, and any transformations applied.
    • Promote consistency in interpretation and usage.
  5. Visualizations:
    • Use charts, graphs, or diagrams to illustrate the cleaning process and the effects of cleaning decisions.
    • Enhance understanding and communication of data quality issues.

Best Practices:

  • Start Early and Update Regularly: Document as you clean, not as an afterthought.
  • Be Clear and Concise: Use language that is easy to understand and avoid technical jargon.
  • Focus on Key Decisions and Rationale: Explain why certain choices were made.
  • Version Control: Track changes and revert to previous versions if needed.
  • Share and Collaborate: Make documentation accessible to team members and stakeholders.
  • Tailor to Audience: Adjust the level of detail based on the intended audience.

Remember: Data cleaning documentation is an investment that pays off in terms of reproducibility, transparency, and trust in your data analysis.

Hi again. Now that you’ve learned how to make your
data squeaky clean, it’s time to address all the
dirt you’ve left behind. When you clean your data, all the incorrect or
outdated information is gone, leaving you with the
highest-quality content. But all those changes you made to the data are valuable too. In this video, we’ll discuss why keeping track of changes is important to every data
project and how to document all your
cleaning changes to make sure everyone
stays informed. This involves documentation which is the process of
tracking changes, additions, deletions and errors involved in your data
cleaning effort. You can think of it
like a crime TV show. Crime evidence is found at the scene and passed on
to the forensics team. They analyze every inch of the scene and
document every step, so they can tell a story
with the evidence. A lot of times, the forensic scientist is called to court to testify about that evidence, and they have a detailed report to refer to. The same thing applies
to data cleaning. Data errors are the crime, data cleaning is
gathering evidence, and documentation is detailing exactly what happened for
peer review or court. Having a record of how
a data set evolved does three very
important things. First, it lets us recover
data-cleaning errors. Instead of scratching our heads, trying to remember what we might have done three months ago, we have a cheat sheet
to rely on if we come across the same
errors again later. It’s also a good idea to create a clean table rather than overriding your
existing table. This way, you still
have the original data in case you need to
redo the cleaning. Second, documentation
gives you a way to inform other users of
changes you’ve made. If you ever go on
vacation or get promoted, the analyst who
takes over for you will have a reference
sheet to check in with. Third, documentation helps
you to determine the quality of the data to
be used in analysis. The first two benefits assume
the errors aren’t fixable. But if they are, a record gives the data engineer more
information to refer to. It’s also a great warning for
ourselves that the data set is full of errors and should
be avoided in the future. If the errors were
time-consuming to fix, it might be better to check out alternative data sets
that we can use instead. Data analysts usually use a changelog to access
this information. As a reminder, a changelog
is a file containing a chronologically ordered list of modifications made to a project. You can use and
view a changelog in spreadsheets and SQL to
achieve similar results. Let’s start with the spreadsheet. We can use Sheet’s version history, which provides a
real-time tracker of all the changes and who made them from individual cells to
the entire worksheet. To find this feature, click the File tab, and then select Version history. In the right panel, choose an earlier version. We can find who
edited the file and the changes they made in the
column next to their name. To return to the current version, go to the top left
and click “Back.” If you want to check out
changes in a specific cell, we can right-click and
select Show Edit History. Also, if you want others to be able to browse a
sheet’s version history, you’ll need to assign permission. Now let’s switch gears
and talk about SQL. The way you create and
view a changelog with SQL depends on the software
program you’re using. Some companies even have
their own separate software that keeps track of changelogs
and important SQL queries. This gets pretty advanced. Essentially, all
you have to do is specify exactly what you did and why when you commit a query to the repository as a new
and improved query. This allows the company to revert back to a previous version if something you’ve done
crashes the system, which has happened to me before. Another option is to just add comments as you go while
you’re cleaning data in SQL. This will help you construct your changelog after the fact. For now, we’ll check out query history, which tracks
all the queries you’ve run. You can click on any of
them to revert back to a previous version
of your query or to bring up an older version
to find what you’ve changed. Here’s what we’ve got. I’m in the Query history tab. Listed on the bottom right are all the queries that
run by date and time. You can click on this
icon to the right of each individual query to bring
it up to the Query editor. Changelogs like these are a great way to keep
yourself on track. It also lets your team get real-time updates
when they want them. But there’s another way
to keep the communication flowing, and that’s reporting. Stick around, and you’ll learn
some easy ways to share your documentation and maybe impress your stakeholders
in the process. See you in the next video.

Reading: Embrace changelogs

Reading

Practice Quiz: Self-Reflection: Creating a changelog

Video: Why documentation is important

Summary of Data Cleaning Reporting:

Data analysis, like a crime drama:

  • Data is the evidence.
  • Data cleaning is the forensic investigation.
  • Reporting is presenting findings in court.

Documentation as evidence log:

  • Tracks changes, deletions, and errors in data cleaning.
  • Serves as a reference for future analysts and stakeholders.
  • Examples: changelogs, comments in code.

Reporting the case to stakeholders:

  • Explains steps taken to clean the data.
  • Quantifies the impact of changes (e.g., removing duplicates).
  • Uses clear and concise language.

Benefits of transparency:

  • Ensures everyone is informed about data quality.
  • Builds trust and credibility with stakeholders.
  • Facilitates collaboration and knowledge sharing.

Next: Learn advanced reporting techniques using SQL comments and other methods.

Tutorial: Data Cleaning Reporting

Presenting Your Clean-Up: A Data Analyst’s Testimony

Data cleaning isn’t just about scrubbing the grime from your data; it’s about presenting your discoveries and justifying your decisions. Just like a forensic scientist testifies on the stand, data analysts become storytellers, revealing the hidden truths they’ve unearthed in the data. This tutorial equips you with the tools to build a compelling narrative around your data cleaning efforts.

Why Report?

  • Transparency: Share your cleaning decisions with stakeholders and collaborators.
  • Reproducibility: Document your steps for future reference and validation.
  • Trustworthiness: Showcase your meticulousness and build confidence in your analysis.
  • Knowledge Sharing: Educate others about the data and potential errors.

Reporting Styles:

  1. Textual Reports:
    • List cleaning steps chronologically with clear explanations.
    • Highlight major changes and their impact on the data.
    • Include tables, charts, and screenshots for visual clarity.
  2. Interactive Notebooks:
    • Embed code with explanations to show transformations in action.
    • Showcase visualized data before and after cleaning.
    • Allow readers to explore and interact with the data analysis.
  3. Presentations:
    • Condense key findings into engaging visuals and concise points.
    • Explain cleaning decisions in a story-like manner.
    • Tailor your message to the audience’s level of technical expertise.

Tips for Effective Reporting:

  • Start Early: Document your process as you clean, not as an afterthought.
  • Focus on Impact: Explain how your changes improve data quality and analysis.
  • Quantify Changes: Use metrics to show the effects of your cleaning (e.g., error reduction, duplicate removal).
  • Use Clear Language: Avoid technical jargon and explain complex concepts in simple terms.
  • Structure Your Report: Make it easy to navigate with logical sections and headings.
  • Proofread and Edit: Ensure accuracy and readability before sharing your report.

Advanced Techniques:

  • SQL Comments: Embed explanations within your cleaning code for reference.
  • Version Control: Track changes and revert to previous versions if needed.
  • Interactive Dashboards: Build dynamic visualizations that update with the data.

Remember: Your data cleaning report is your chance to shine as a data detective. Showcase your meticulous work, explain your discoveries in a compelling way, and build trust in your data-driven conclusions.

Ready to present your case? Go forth and clean, document, and share!

Great, you’re back. Let’s set the stage. The crime is dirty data. We’ve gathered the evidence. It’s been cleaned, verified,
and cleaned again. Now it’s time to
present our evidence. We’ll retrace the steps and present our case to our peers. As we discussed
earlier, data cleaning, verifying, and reporting
is a lot like crime drama. Now it’s our day in court. Just like a forensic scientist testifies on the stand
about the evidence, data analysts are
counted on to present their findings after a
data cleaning effort. Earlier, we learned
how to document and track every step of the
data cleaning process, which means we have solid
information to pull from. As a quick refresher, documentation is the process of tracking changes, additions, deletions, and errors involved
in a data cleaning effort, changelogs are good
example of this. Since it’s staged
chronologically, it provides a real-time
account of every modification. Documenting will be
a huge time saver for you as a future data analyst. It’s basically a cheatsheet you can refer to if
you’re working with the similar data set or need
to address similar errors. While your team can view
changelogs directly, stakeholders can’t and have to rely on your report
to know what you did. Lets check out how
we might document our data cleaning process using example we
worked with earlier. In that example, we found
that this association had two instances of
the same membership for $500 in its database. We decided to fix this manually by deleting
the duplicate info. There’re plenty of
ways we could go about documenting what we did. One common way is to
just create a doc listing out the steps we took
and the impact they had. For example, first on your list would be that you remove the duplicate instance, which decreased the number
of rows from 33 to 32, and lowered the
membership total by $500. If we were working with SQL, we could include a comment in the statement describing
the reason for a change without affecting the execution of the statement. That’s something a
bit more advanced, which we’ll talk about later. Regardless of how we capture
and share our changelogs, we’re setting ourselves
up for success by being 100 percent transparent
about our data cleaning. This keeps everyone on
the same page and shows project stakeholders that we are accountable for
effective processes. In other words, this helps build our credibility
as witnesses who can be trusted to present all the evidence accurately
during testimony. For dirty data, it’s
an open and shut case.

Video: Feedback and cleaning

Summary of Data Cleaning Feedback and Application:

Beyond Validation:

  • Data cleaning provides insights beyond just data accuracy.
  • Error patterns revealed during cleaning can be used to improve data collection processes.

Common Data Errors:

  • Human mistakes (typos, misspellings)
  • Flawed processes (survey design)
  • System issues (integration errors)

Data Cleaning as Feedback Loop:

  • Documentation and reporting enable error pattern identification.
  • Feedback used to adjust data collection procedures and quality control.
  • Examples: reprogramming data collection, updating survey questions, data engineer meetings.

Benefits of Error Reduction:

  • Increased trust in data for decision-making.
  • Improved data collection efficiency.
  • Potential for increased revenue.

Next Steps:

  • Continue building data validation skills.
  • Apply these lessons to future data cleaning projects.

Tutorial: Data Cleaning Feedback and Application

Beyond the Report: Transforming Data Insights into Action

Congratulations! You’ve meticulously cleaned, documented, and reported on your data. But the journey doesn’t end there. The real treasure lies in leveraging the feedback from your cleaning process to optimize future data collection and analysis. This tutorial will guide you through how to transform data cleaning insights into actionable improvements.

From Glitches to Growth:

Data cleaning isn’t just about removing dirt; it’s about uncovering gems. The errors and inconsistencies you encounter reveal valuable information about your data collection and processing systems. By analyzing these “glitches,” you can identify opportunities for growth and improvement.

Identifying Error Patterns:

  • Human Errors: Look for patterns in typos, misspellings, and inconsistent formatting. This might indicate poorly designed data entry interfaces or training needs for data collectors.
  • Process Errors: Analyze error frequency in specific fields or survey questions. This could expose flaws in survey design, data validation rules, or data transfer procedures.
  • System Errors: Track inconsistencies arising from data integration or software issues. This often requires collaboration with data engineers to diagnose and fix underlying system problems.

Turning Feedback into Action:

  • Data Collection Optimization: Use your findings to redesign data entry forms, improve user interfaces, or implement stricter validation rules.
  • Survey Reform: Identify poorly worded questions, ambiguous choices, or missing answer options. Revise your survey instrument based on these insights.
  • Quality Control Enhancement: Schedule regular data audits, implement automated error detection algorithms, or establish data cleaning protocols for specific data sources.
  • System Upgrades: Collaborate with data engineers to address integration issues, fix buggy software, or implement data cleansing algorithms at the source.

Communication is Key:

Share your feedback and proposed improvements with stakeholders, data owners, and data engineering teams. Clear communication and collaboration are crucial for implementing effective changes.

The Ripple Effect of Better Data:

By refining your data collection and processing systems, you’ll reap the rewards of cleaner, more reliable data. This, in turn, leads to:

  • Accurate Analyses: More trustworthy data fuels confident decision-making and impactful business strategies.
  • Efficient Operations: Reduced errors and improved data quality streamline workflows and save time and resources.
  • Increased Revenue: Insightful data analysis can uncover new opportunities for growth, market expansion, and cost optimization.

Remember: Data cleaning is not just a chore; it’s a powerful tool for continuous improvement. Embrace the feedback it offers, refine your data collection systems, and empower your organization to make data-driven decisions that lead to success.

Start applying these techniques to your next data cleaning project and watch your data transform from messy to marvelous!

Welcome back. By now it’s safe to say that verifying,
documenting and reporting are valuable steps
in the data-cleaning process. You have proof to give stakeholders
that your data is accurate and reliable. And the effort to attain it was
well-executed and documented. The next step is getting feedback about
the evidence and using it for good, which we’ll cover in this video. Clean data is important
to the task at hand. But the data-cleaning process itself
can reveal insights that are helpful to a business. The feedback we get when we report on our
cleaning can transform data collection processes, and
ultimately business development. For example, one of the biggest challenges of working
with data is dealing with errors. Some of the most common errors involve
human mistakes like mistyping or misspelling, flawed processes like poor
design of a survey form, and system issues where older
systems integrate data incorrectly. Whatever the reason, data-cleaning
can shine a light on the nature and severity of error-generating processes. With consistent documentation and
reporting, we can uncover error patterns in data
collection and entry procedures and use the feedback we get to make
sure common errors aren’t repeated. Maybe we need to reprogram
the way the data is collected or change specific questions
on the survey form. In more extreme cases, the feedback we get can even send
us back to the drawing board to rethink expectations and possibly
update quality control procedures. For example, sometimes it’s useful to
schedule a meeting with a data engineer or data owner to make sure
the data is brought in properly and doesn’t require constant cleaning. Once errors have been identified and
addressed, stakeholders have data they can trust for
decision-making. And by reducing errors and
inefficiencies in data collection, the company just might discover
big increases to its bottom line. Congratulations! You now have the foundation you need
to successfully verify a report on your cleaning results. Stay tuned to keep building
on your new skills.

Reading: Advanced functions for speedy data cleaning

Reading

Practice Quiz: Test your knowledge on documenting the cleaning process

Why is it important for a data analyst to document the evolution of a dataset? Select all that apply.

Fill in the blank: While cleaning data, documentation is used to track _____. Select all that apply.

Documenting data-cleaning makes it possible to achieve what goals? Select all that apply.

Module 4 challenge


Reading: Glossary: Terms and definitions

Quiz: Module 4 challenge

What is involved in seeing the big picture when verifying data cleaning? Select all that apply.

Fill in the blank: TRIM is a function that removes _____ spaces in data. Select all that apply.

A data analyst uses the COUNTA function to count which of the following?

Fill in the blank: A data analyst uses the CASE statement to consider one or more _____, then returns a value.

Fill in the blank: Documentation is the process of tracking _____ during data cleaning. Select all that apply.

At what point during the analysis process does a data analyst use a changelog?

A data analyst commits a query to the repository as a new and improved query. Then, they specify the changes they made and why they made them. This scenario is part of what process?