Module 3: Cleaning data with SQL

Knowing a variety of ways to clean data can make an analyst’s job much easier. In this part of the course, you’ll check out how to clean your data using SQL. You’ll explore queries and functions that you can use in SQL to clean and transform your data to get it ready for analysis.

Learning Objectives

Describe how SQL can be used to clean large datasets
Compare spreadsheet data-cleaning functions to those associated with SQL in databases
Develop basic SQL queries for use with databases
Apply basic SQL functions for use in cleaning string variables in a database
Apply basic SQL functions for transforming data variables

Table Of Contents

Using SQL to clean data
Learn basic SQL queries
Transforming data
Module 3 challenge

Using SQL to clean data

Video: Using SQL to clean data

Notes

Transcript

Data cleaning is crucial for data analysis, preparing it for the “actual analysis” stage.
This session focuses on data cleaning using SQL, exploring its tools and effectiveness for large datasets.
Upcoming topics include:
- Comparing data cleaning functions in spreadsheets and SQL.
- Developing basic search queries for databases.
- Applying basic SQL functions for data transformation and string cleaning.
A brief introduction to SQL and its suitability for data cleaning tasks is also planned.

Key Takeaways:

Master SQL, a powerful tool for cleaning large datasets.
Learn essential SQL functions for transforming and cleaning data.
Develop basic search queries to efficiently navigate databases.

Next Steps:

Dive deeper into the world of data cleaning with SQL!

Welcome back and
great job on that last weekly challenge. Now that we know the difference
between cleaning dirty data and some general data cleaning techniques, let’s focus on data cleaning using SQL. Coming up we’ll learn about the different
data cleaning functions in spreadsheets and SQL and how SQL can be used
to clean large data sets. I’ll also show you how to develop some
basic search queries for databases and how to apply basic SQL functions for
transforming data and cleaning strings. Cleaning your data is the last
step in the data analysis process before you can move on to
the actual analysis, and SQL has a lot of great tools
that can help you do that. But before we start cleaning databases,
we’ll take a closer look at SQL and when to use it. I’ll see you there.

Video: Sally: For the love of SQL

Notes

Video

Transcript

Topic: The Benefits of Being a Data Analyst with SQL Skills

Speaker: Sally, a Measurement and Analytical Lead at Google

Key Points:

Data analysts help advertising agencies use Google platforms and analyze results for their clients.
Data analysis skills are in high demand across various industries like healthcare, e-commerce, and entertainment.
SQL enables quick and efficient analysis of large datasets, a significant improvement over past limitations.
Knowing SQL is a core skill for data analysts and highly sought after by employers, making it career-boosting.
Sally’s personal experience highlights the satisfaction and power of mastering SQL through self-learning.
The growing volume of data and computing power further increases the demand for skilled data analysts, promising a bright career outlook.

Overall Message:

Being a data analyst with strong SQL skills offers exciting career opportunities and intellectual satisfaction in a data-driven world.

Advertising agencies get
money from their clients to advertise their brand. These agencies use our products, use certain Google platforms,
advertising platforms, and I help them with how to
best use those platforms, different strategies they
can use to be best in class. A lot of the folks at
the advertising agencies have reports that they have
to send out to their clients. These reports take a lot of
time to create and visualize, and so what I do is I
help the practitioners and the analytics teams use a particular product that
enables them to create those reports much
faster and much easier. If you’re going to start
off as a data analyst, it opens tons of doors because everybody
is tracking data, is using data, needs to use
data, regardless of industry. Anywhere from health care, to advertising, to e-commerce, to entertainment, anything and everything,
everybody uses data, so everybody needs you
as a data analyst. SQL makes our lives easier when we’re analyzing
lots of different data. It’s only somewhat recently that the SQL programs
that we use now can give us instant results for analyzing millions
or billions of data. Years ago, maybe about
five years ago or so, even though we could still analyze those millions of rows, we would end up having
to wait fifteen minutes, thirty minutes for the
queries to run. But now it’s instantaneous, and so that’s really exciting, and we can do so much
more with that power. SQL has helped a lot in my
career because it’s one of those fundamental
things you have to know as a data analyst. Back in the day, not
everyone knew SQL, so knowing SQL was definitely
a competitive advantage. Nowadays, I would
say more people, maybe most people know it. It’s a core skill and highly sought after
by everybody. So, knowing SQL, becoming
a data analyst makes you quite popular
from recruiters, so I think that’s really fun. I taught myself SQL, so my knowledge about
SQL is something I hold near and dear,
close to my heart since it’s something that
almost I’ve made for myself, and I feel so much
satisfaction from it. So that’s why I really like SQL. One of the fun
things about SQL and another reason why I
really enjoy using it is because when you type
something in that query, and you just hit Control, Shift, Enter, or once you’ve run the query, you get the results almost instantly, depending on
the platform you use. But it’s fascinating
to see if you think conceptually how much analysis the computer is doing
for you based on that little bit of command code or a little
bit of code you wrote, and it’s just so powerful if you think about what’s
happening behind the scenes. So I think that’s fun to look at. We live in a world of big data, and it keeps getting bigger. The computing power is also
increasing exponentially. With all the data
that we can track, the more and more we can track that data, the more and more we need data analysts. Our career prospects are
basically skyrocketing. I’m Sally, I’m a measurement and
analytical lead at Google.

Video: Understanding SQL capabilities

Notes

Video

Tutorial

Transcript

Summary: What is SQL and Why Do Data Analysts Use It?

Key Points:

SQL Definition: A structured query language for working with databases.
Ideal for large datasets: Can process trillions of rows, unlike spreadsheets.
Efficient and fast: Processes massive data in seconds.
Standardized: Common language for relational database communication.
Personal experience: The speaker learned SQL for a job requirement and found it rewarding.
Learning resources: Many online resources are available for self-learning.
Next steps: Exploring practical applications of SQL and its connection to spreadsheet skills.

Overall Message:

SQL is a powerful tool for data analysts, especially when dealing with large datasets. Its efficiency, standardization, and ease of learning make it an essential skill for anyone in the data field.

What is SQL and Why Do Data Analysts Use It? – A Tutorial

Get ready to unlock the world of structured query language (SQL)! This tutorial will demystify what SQL is, why data analysts rely on it, and how it simplifies working with massive datasets.

What is SQL?

Imagine a language specifically designed to talk to databases, tell them what data you need, and organize it in meaningful ways. That’s SQL! It’s a powerful tool that lets data analysts:

Read and retrieve data: Pull specific information from tables within a database.
Update and manipulate data: Edit, add, or delete entries to keep your data accurate.
Analyze and summarize data: Calculate statistics, group information, and discover insights.

Why Do Data Analysts Love SQL?

Unlike spreadsheets that struggle with large datasets, SQL excels at managing massive amounts of data with ease. Here’s why it’s a data analyst’s best friend:

Speed: SQL can process millions or even billions of rows in seconds, while spreadsheets might take hours or even days for the same task.
Efficiency: It handles complex queries with elegant syntax, allowing analysts to retrieve and manipulate data with precision and minimal code.
Scalability: As your data grows, SQL gracefully adapts, effortlessly handling even the biggest datasets.
Standardization: SQL is the universal language of relational databases, making it readily transferable between different platforms and companies.
Versatility: It’s not just for reading data! SQL lets you analyze, transform, and even combine information from multiple tables, unlocking powerful insights.

Real-world examples:

Marketing analyst: Uses SQL to track website traffic, analyze campaign performance, and identify target audiences.
Financial analyst: Queries databases to assess financial trends, forecast market behavior, and evaluate investment options.
Healthcare analyst: Leverages SQL to analyze patient data, identify disease patterns, and improve healthcare delivery.

Learning SQL:

The good news is that anyone can learn SQL! Plenty of online resources and interactive courses can guide you through the basics and advanced concepts. With dedication and practice, you’ll be querying databases like a pro in no time.

Ready to take the plunge?

This tutorial is just the beginning! Get ready to explore the fascinating world of SQL, unlock its power for data analysis, and unleash your inner data wizard!

Remember:

SQL is a valuable skill for data analysts across various industries.
Its efficiency, scalability, and versatility make it essential for handling large datasets.
Learning SQL opens doors to exciting career opportunities and empowers you to extract valuable insights from data.

So, start your SQL journey today and discover the magic of querying your way to data-driven success!

Hello, again. So before we go over all the ways
data analysts use SQL to clean data, I want to formally introduce you to SQL. We’ve talked about SQL a lot already. You’ve seen some databases and
some basic functions in SQL, and you’ve even seen how SQL
can be used to process data. But now let’s actually define SQL. SQL is a structured query language
that analysts use to work with databases. Data analysts usually use SQL
to deal with large datasets because it can handle
huge amounts of data. And I mean trillions of rows. That’s a lot of rows to
wrap your head around. So let me give you an idea about
how much data that really is. Imagine a data set that contains the names
of all 8 billion people in the world. It would take the average person 101
years to read all 8 billion names. SQL can process this in seconds. Personally, I think that’s pretty cool. Other tools like spreadsheets might take
a really long time to process that much data, which is one of the main reasons
data analysts choose to use SQL, when dealing with big datasets. Let me give you a short history on SQL. Development on SQL actually
began in the early 70s. In 1970, Edgar F.Codd developed
the theory about relational databases. You might remember learning about
relational databases a while back. This is a database that contains a series
of tables that can be connected to form relationships. At the time IBM was using a relational
database management system called System R. Well, IBM computer scientists were trying
to figure out a way to manipulate and retrieve data from IBM System R. Their first query
language was hard to use. So they quickly moved on
to the next version, SQL. In 1979, after extensive testing SQL, now
just spelled S-Q-L, was released publicly. By 1986,
SQL had become the standard language for relational database communication,
and it still is. This is another reason why
data analysts choose SQL. It’s a well-known standard
within the community. The first time I used SQL to pull
data from a real database was for my first job as a data analyst. I didn’t have any background
knowledge about SQL before that. I only found out about it because
it was a requirement for that job. The recruiter for
that position gave me a week to learn it. So I went online and researched it and
ended up teaching myself SQL. They actually gave me a written test
as part of the job application process. I had to write SQL queries and
functions on a whiteboard. But I’ve been using SQL ever since. And I really like it. And just like I learned SQL on my own, I wanted to remind you that you can
figure things out yourself too. There’s tons of great online resources for
learning. So don’t let one job requirement
stand in your way without doing some research first. Now that we know a little more about why
analysts choose to work with SQL when they’re handling a lot of data and
a little bit about the history of SQL, we’ll move on and
learn some practical applications for it. Coming up next, we’ll check out some of
the tools we learned in spreadsheets and figure out if any of those
apply to working in SQL. Spoiler alert, they do. See you soon.

Reading: Using SQL as a junior data analyst

Reading

Using-SQL-as-a-junior-data-analyst Download

Video: Spreadsheets versus SQL

Notes

Video

Tutorial

Quiz

Transcript

Summary of Spreadsheet vs. SQL Similarities and Differences:

Similarities:

Both involve tools for data cleaning, manipulation, and analysis.
Both allow performing arithmetic calculations and joining data sources.
Both can be used to find specific information within data sets.

Differences:

Purpose: Spreadsheets for smaller, personal data sets; SQL for larger, shared data sets in databases.
Ease of Use: Spreadsheets have built-in functions and user interfaces; SQL requires coding knowledge.
Data Access: Spreadsheets access data manually; SQL pulls data automatically from various sources.
Collaboration: Spreadsheets for individual work; SQL for teamwork and audit trails.
Scalability: Spreadsheets limited to millions of rows; SQL handles trillions of rows across different database programs.

Key Takeaways:

Spreadsheets and SQL complement each other.
Choose the right tool based on data size, collaboration needs, and desired functionality.
Next steps: Learn more SQL queries and functions, leverage spreadsheet tools in new ways.

Welcome to this tutorial on Spreadsheets and SQL: Understanding Their Similarities and Differences!

In this tutorial, we’ll explore:

Commonalities between spreadsheets and SQL
Key distinctions in their purpose, usage, and capabilities
Guidance on choosing the appropriate tool for your data tasks

Let’s dive in!

I. Common Ground: Tools for Data Exploration

Data Cleaning: Both spreadsheets and SQL offer features for identifying and correcting errors, ensuring data integrity.
Calculations: Perform arithmetic operations, create formulas, and manipulate data numerically in either tool.
Data Joining: Combine information from multiple sources, such as merging tables or linking datasets.
Information Retrieval: Locate specific data points using functions like COUNTIF in spreadsheets or COUNT and WHERE in SQL.

II. Key Differences: Delving into Distinctions

Purpose and Scale:
- Spreadsheets excel at managing smaller, personal datasets, while SQL shines with large, shared datasets within databases.
Ease of Use:
- Spreadsheets provide a visual interface and built-in functions, making them user-friendly. SQL requires learning a query language, demanding some coding knowledge.
Data Access:
- Spreadsheets involve manual data entry, limiting their scope. SQL empowers you to pull data automatically from diverse database sources.
Collaboration and Tracking:
- Spreadsheets are often used for individual work, while SQL facilitates teamwork with shared databases and query histories.
Scalability:
- Spreadsheets can handle millions of rows, but SQL can manage massive datasets, even trillions of rows.

III. Choosing the Right Tool for the Job

Data Size: For smaller datasets and quick analysis, spreadsheets often suffice. For large-scale data management and complex queries, SQL is the preferred choice.
Collaboration: SQL excels in multi-user environments, ensuring data consistency and tracking changes.
Functionality: Spreadsheets offer user-friendly features like formatting and spellcheck, while SQL provides advanced capabilities like data aggregation and complex filtering.

IV. Next Steps: Expanding Your Data Toolkit

Deepen Your SQL Skills: Learn more queries and functions to unlock the full power of database interaction.
Reimagine Spreadsheet Use: Discover new ways to leverage spreadsheets alongside SQL for a comprehensive data analysis approach.

Remember, spreadsheets and SQL are not mutually exclusive. They complement each other, each offering unique strengths for different data challenges. By understanding their similarities and differences, you’ll make informed decisions to achieve your data goals effectively!

A team of analysts is working on a data analytics project. How could data in a SQL database be more useful to the team than data in spreadsheets? Select all that apply.

They can track changes to SQL queries across the team.
They can use SQL to interact with the database program.
They can use SQL to pull information from the database at the same time.

Data stored in a SQL database is useful to a project with multiple team members because they can access the data at the same time, use SQL to interact with the database program, and track changes to SQL queries across the team.

Hey there. So far we’ve learned about both
spreadsheets and SQL. While there’s lots of differences between
spreadsheets and SQL, you’ll find some
similarities too. Let’s check out what
spreadsheets and SQL have in common and how
they’re different. Spreadsheets and SQL actually
have a lot in common. Specifically, there’s
tools you can use in both spreadsheets and SQL
to achieve similar results. We’ve already learned about some tools for cleaning data in spreadsheets, which
means you already know some tools that
you can use in SQL. For example, you can
still perform arithmetic, use formulas and join data
when you’re using SQL, so we’ll build on the skills we’ve
learned in spreadsheets and use them to do even
more complex work in SQL. Here’s an example of what I
mean by more complex work. If we were working
with health data for a hospital, we’d need to be able to access and process a lot of data. We might need
demographic data, like patients’ names,
birthdays, and addresses, information about their
insurance or past visits, public health data or even user generated data to add to
their patient records. All of this data
is being stored in different places, maybe even in different formats, and
each location might have millions of rows and
hundreds of related tables. This is way too
much data to input manually, even for
just one hospital. That’s where SQL comes in handy. Instead of having to look at each individual data
source and record it in our spreadsheet, we
can use SQL to pull all this information from different locations
in our database. Now, let’s say we want to find
something specific in all this data, like how many patients with a certain
diagnosis came in today. In a spreadsheet we can use the COUNTIF function to
find that out, or we can combine the COUNT
and WHERE queries in SQL to find out how many rows
match our search criteria. This will give us similar
results, but works with a much larger and more
complex set of data. Next, let’s talk about how spreadsheets and
SQL are different. First, it’s important
to understand that spreadsheets and SQL
are different things. Spreadsheets are generated with a program like Excel
or Google Sheets. These programs are designed to execute certain
built-in functions. SQL on the other hand is a language that can
be used to interact with database programs, like Oracle MySQL or
Microsoft SQL Server. The differences between the two are mostly in
how they’re used. If a data analyst was given data in the form of a
spreadsheet they’ll probably do their data cleaning and analysis within that
spreadsheet, but if they’re working with a large data set with more than a million rows or multiple files within
a database, it’s easier, faster and more
repeatable to use SQL. SQL can access and use a lot more data because it
can pull information from different sources in the
database automatically, unlike spreadsheets which only have access to the
data you input. This also means that data is
stored in multiple places. A data analyst might
use spreadsheets stored locally on their
hard drive or their personal cloud when
they’re working alone, but if they’re on a larger team
with multiple analysts who need to access and use
data stored across a database, SQL might
be a more useful tool. Because of these
differences, spreadsheets and SQL are used for
different things. As you already know,
spreadsheets are good for smaller data sets and when
you’re working independently. Plus, spreadsheets have
built-in functionalities, like spell check that
can be really handy. SQL is great for working
with larger data sets, even trillions of rows of data. Because SQL has been the standard language for communicating with
databases for so long, it can be adapted and used for multiple
database programs. SQL also records changes
in queries, which makes it easy to track changes across your team if you’re
working collaboratively. Next, we’ll learn more
queries and functions in SQL that will give you some
new tools to work with. You might even learn how
to use spreadsheet tools in brand new ways.
See you next time.

Reading: SQL dialects and their uses

Reading

In this reading, you will learn more about SQL dialects and some of their different uses. As a quick refresher, Structured Query Language, or SQL, is a language used to talk to databases. Learning SQL can be a lot like learning a new language — including the fact that languages usually have different dialects within them. Some database products have their own variant of SQL, and these different varieties of SQL dialects are what help you communicate with each database product.

These dialects will be different from company to company and might change over time if the company moves to another database system. So, a lot of analysts start with Standard SQL and then adjust the dialect they use based on what database they are working with. Standard SQL works with a majority of databases and requires a small number of syntax changes to adapt to other dialects.

As a junior data analyst, it is important to know that there are slight differences between dialects. But by mastering Standard SQL, which is the dialect you will be working with in this program, you will be prepared to use SQL in any database.

More information

You may not need to know every SQL dialect, but it is useful to know that these different dialects exist. If you are interested in learning more about SQL dialects and when they are used, you can check out these resources for more information:

LearnSQL’s blog, What Is a SQL Dialect, and Which One Should You Learn?
Software Testing Help’s article, Differences Between SQL Vs MySQL vs SQL Server
Datacamp’s blog, SQL Server, PostgreSQL, MySQL… what’s the difference? Where do I start? Note that there is an error in this blog article. The comparison table incorrectly states that SQlite uses subqueries instead of window functions. Refer to the SQLite Window Functions documentation for proper clarification.
SQL Tutorial’s tutorial, What is SQL

Practice Quiz: Hands-On Activity: Processing time with SQL

Reading

Hands-On-Activity_-Processing-time-with-SQL Download

Practice Quiz: Test your knowledge on SQL

Which of the following are benefits of using SQL? Select all that apply.

SQL can handle huge amounts of data.
SQL can be adapted and used with multiple database programs.
SQL offers powerful tools for cleaning data.

SQL can handle huge amounts of data, can be adapted and used with multiple database programs, and offers powerful tools for cleaning data.

Which of the following tasks can data analysts do using both spreadsheets and SQL? Select all that apply.

Perform arithmetic, Use formulas, Join data

Analysts can use SQL and spreadsheets to perform arithmetic, use formulas, and join data.

SQL is a language used to communicate with databases. Like most languages, SQL has dialects. What are the advantages of learning and using standard SQL? Select all that apply.

Standard SQL requires a small number of syntax changes to adapt to other dialects.
Standard SQL works with a majority of databases.

Standard SQL works with a majority of databases and requires a small number of syntax changes to adapt to other dialects.

Learn basic SQL queries

Reading: Optional: Upload the customer dataset to BigQuery

Reading

Optional_-Upload-the-customer-dataset-to-BigQuery Download

Video: Widely used SQL queries

Tutorial

Video

Here’s a tutorial on widely used SQL queries:

Introduction to SQL Queries:

SQL (Structured Query Language) is the standard language for interacting with relational databases.
Queries are instructions you send to the database to retrieve, manipulate, or modify data.
Mastering these queries is essential for data analysis, reporting, and application development.

Common SQL Queries:

SELECT:
- Retrieves specific data from a database table.
- Basic syntax: SELECT column1, column2 FROM table_name;
- Example: SELECT name, email FROM customers;
WHERE:
- Filters results based on conditions.
- Syntax: SELECT columns FROM table_name WHERE condition;
- Example: SELECT product_name, price FROM products WHERE price > 50;
ORDER BY:
- Sorts results in ascending or descending order.
- Syntax: SELECT columns FROM table_name ORDER BY column ASC/DESC;
- Example: SELECT employee_name, salary FROM employees ORDER BY salary DESC;
INSERT INTO:
- Adds new rows (records) to a table.
- Syntax: INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...);
- Example: INSERT INTO orders (order_id, customer_id, order_date) VALUES (1234, 5, '2023-01-10');
UPDATE:
- Modifies existing data in a table.
- Syntax: UPDATE table_name SET column1 = value1, column2 = value2 WHERE condition;
- Example: UPDATE products SET price = 75 WHERE product_id = 10;
DELETE:
- Removes rows from a table.
- Syntax: DELETE FROM table_name WHERE condition;
- Example: DELETE FROM customers WHERE customer_id = 25;

Additional Queries:

LIKE: Searches for patterns in text fields.
JOIN: Combines data from multiple tables.
GROUP BY: Groups data together and applies aggregate functions (SUM, COUNT, AVG, etc.).
HAVING: Filters grouped results based on conditions.

Remember:

Practice these queries using a SQL database or online exercises.
Explore advanced SQL concepts and techniques for complex data manipulation.
SQL is a powerful tool for data management and analysis, so proficiency in these queries is invaluable.

Video: Evan: Having fun with SQL

Notes

Video

Transcript

Summary: Why SQL is an Amazing First Language for Data Enthusiasts

Speaker: Evan, a learning portfolio manager at Google, passionate about data and making it accessible.

Main Points:

Evan transitioned from accounting to data analysis due to his love for numbers and desire to automate tasks.
SQL is an ideal first language for data beginners due to its ease of learning and powerful capabilities.
Despite not being a computer science expert, Evan found SQL accessible and fun to master.
Key benefits of SQL:
- Quick and efficient: Retrieve billions of data points in seconds.
- Interactive: Query and filter data easily to ask complex questions.
- Easy to learn: Simple syntax and clear structure.
- Variety: Multiple ways to write the same query, encouraging exploration.

Key Challenge:

Formulating insightful questions for your data, not just mastering the syntax.

Call to Action:

Be curious about your data and experiment with different SQL queries.
Share your queries and results to collaborate and learn from others.

Overall Message:

SQL is a powerful tool for unlocking the potential of data and making informed decisions. Don’t be intimidated by its technical aspect; dive in, explore, and have fun!

Hi, I’m Evan. I’m a learning portfolio
manager here at Google. I don’t think I’m
a computer science or super engineering type, but I really, really like working
with numbers, so actually, I went into accounting. And about after two years
of accounting I said, “Wow, I really don’t want
to do all this by hand,” so I took my first
information systems class, where they taught me the
language SQL or S-Q-L, and it completely
opened up my mind. Between a working knowledge
of spreadsheets where you change one cell and the whole spreadsheet
changes because those amazing
calculated fields and SQL where I can query billions of rows of data
in a matter of second, I was completely sold
on my love for data. I’ve dedicated my life and my career to just communicating that passion and getting
folks excited about the things that they
can do with their data. Why is SQL such an amazing
first language to pick up? Well, there’s so many things
that you can do with it. I will first caveat and say, I am not a computer
science major. I don’t know deep
down Java and Python, and I was a little
bit apprehensive of learning a
computer language. It’s like a pseudo-programming
language, but in reality, you can write your first SQL
statement as you’re going to find out here in just
five minutes or less. SQL, honestly, it’s
one of those languages that’s easy to learn and
even more fun to master. I’ve been learning
SQL for 15 years. I’ve been teaching it for 10. As you’re going to
see in some of these hands-on labs you’ll be
working through, it’s very easy to return data from within a
database or a data set. Just select whatever columns from whichever database that you’re pulling from, and immediately
you get the data back. Now, the really fun part is actually teasing
apart and saying, I wonder if I change my query, add these more columns, filter this data set
a different way, share with my colleagues. It’s meant to be an
interactive querying language, and “query” means
“asking a question.” If I can challenge
you one thing, it’s that the syntax
for picking up SQL, much like the rules
of a chess game, are very easy to pick up. But the hard part is actually not the syntax writing, much like with any
programming language, but the actual what question do you want to ask of your data? What I would encourage
you to do is be super curious about whatever
data set that you’re given. Spend a lot of time, even
before you touch your keyboard, in thinking about what data
set or what insights you can get from your data. And
then start having fun. There’s many different
ways to write the same correct SQL statement, so try one out, share it with
your friends and then start returning that data
back for insights. Good luck.

Video: Cleaning string variables using SQL

Notes

Video

Tutorial

Transcript

Summary: Cleaning Data with SQL Functions

This video delves into using SQL functions for cleaning and preparing data before analysis.

Key Points:

Removing Duplicates:
- Use DISTINCT in the SELECT statement to identify and remove duplicate rows.
- Example: SELECT DISTINCT customer_id FROM customer_data.customer_address
Cleaning String Variables:
- LENGTH: Checks the number of characters in a string.
  - Useful for verifying consistency (e.g., all country codes should be 2 letters).
  - Example: SELECT LENGTH(country) AS letters_in_country FROM customer_data.customer_address
- SUBSTRING: Extracts a specific portion of a string.
  - Used to fix inconsistencies like “USA” instead of “US” for country codes.
  - Example: SELECT customer_id FROM customer_data.customer_address WHERE SUBSTRING(country, 1, 2) = 'US'
- TRIM: Removes leading and trailing spaces from a string.
  - Useful for eliminating unexpected spaces that affect data accuracy.
  - Example: SELECT DISTINCT customer_id FROM customer_data.customer_address WHERE TRIM(state) = 'OH'

Benefits of Cleaning Data:

Ensures consistency and accuracy for accurate analysis.
Saves time and avoids errors later in the process.

Overall message:

Mastering basic SQL functions like LENGTH, SUBSTRING, and `TRIM empowers you to effectively clean and prepare your data for meaningful analysis.

Next Steps:

Explore more advanced cleaning functions and techniques in SQL.
Apply these learnings to clean and analyze your own data sets.

Here’s a tutorial on Cleaning Data with SQL Functions:

Introduction:

Emphasize the importance of data cleaning in ensuring accuracy and reliability for analysis.
Introduce SQL as a powerful tool for data cleaning and manipulation.

Key SQL Functions for Data Cleaning:

Removing Duplicates:
- Explain the purpose of removing duplicates to avoid redundancy and skewed results.
- Demonstrate the use of DISTINCT with examples: SQLSELECT DISTINCT customer_id FROM customer_data.customer_address;
Handling String Variables:
- Address common issues with string data, such as inconsistencies and formatting errors.
- Introduce the following functions with examples:
  - LENGTH: SQLSELECT LENGTH(country) AS letters_in_country FROM customer_data.customer_address;
  - SUBSTRING: SQLSELECT customer_id FROM customer_data.customer_address WHERE SUBSTRING(country, 1, 2) = 'US';
  - TRIM: SQLSELECT DISTINCT customer_id FROM customer_data.customer_address WHERE TRIM(state) = 'OH';
Additional Cleaning Functions:
- Briefly introduce other useful functions for data cleaning:
  - UPPER/LOWER: Convert text to uppercase or lowercase.
  - REPLACE: Replace specific characters or text within strings.
  - COALESCE: Handle null values by providing a default value.

Applying Functions in SQL Queries:

Provide practical examples of using these functions within SQL queries to achieve cleaning tasks.
Encourage learners to practice with a sample dataset.

Common Data Cleaning Scenarios:

Discuss common challenges faced in data cleaning and how SQL functions can address them:
- Inconsistent formatting (e.g., date formats, capitalization)
- Missing values
- Outliers
- Duplicate records

Best Practices:

Emphasize the importance of understanding your data and identifying cleaning needs before applying functions.
Promote careful testing and validation to ensure cleaning tasks are effective.
Encourage documentation of cleaning steps for reproducibility and transparency.

Conclusion:

Reiterate the benefits of data cleaning for accurate and reliable analysis.
Encourage further exploration of SQL’s cleaning capabilities and hands-on practice.
Highlight the importance of data cleaning as an essential step in any data analysis process.

It’s so great to have you back. Now that we know some basic SQL queries
and spent some time working in a database, let’s apply that knowledge to something
else we’ve been talking about: preparing and cleaning data. You already know that cleaning and completing your data before you
analyze it is an important step. So in this video, I’ll show you some
ways SQL can help you do just that, including how to remove duplicates, as well as four functions to
help you clean string variables. Earlier, we covered how to remove
duplicates in spreadsheets using the Remove duplicates tool. In SQL, we can do the same thing by including
DISTINCT in our SELECT statement. For example,
let’s say the company we work for has a special promotion for
customers in Ohio. We want to get the customer IDs
of customers who live in Ohio. But some customer information
has been entered multiple times. We can get these customer IDs by writing SELECT customer_id FROM customer_data.customer_address. This query will give us duplicates
if they exist in the table. If customer ID 9080 shows up
three times in our table, our results will have
three of that customer ID. But we don’t want that.
We want a list of all unique customer IDs. To do that, we add DISTINCT to
our SELECT statement by writing, SELECT DISTINCT customer_id FROM
customer_data.customer_address. Now, the customer ID 9080 will
show up only once in our results. You might remember we’ve talked before about
text strings as a group of characters within a cell, commonly composed
of letters, numbers, or both. These text strings need
to be cleaned sometimes. Maybe they’ve been entered differently in
different places across your database, and now they don’t match. In those cases, you’ll need to clean
them before you can analyze them. So here are some functions you can use
in SQL to handle string variables. You might recognize some of these
functions from when we talked about spreadsheets. Now it’s time to see
them work in a new way. Pull up the data set we shared
right before this video. And you can follow along step-by-step
with me during the rest of this video. The first function I want to show you is
LENGTH, which we’ve encountered before. If we already know the length our
string variables are supposed to be, we can use LENGTH to double-check that
our string variables are consistent. For some databases, this query is written
as LEN, but it does the same thing. Let’s say we’re working with
the customer_address table from our earlier example. We can make sure that all country
codes have the same length by using LENGTH on each of these strings. So to write our SQL query,
let’s first start with SELECT and FROM. We know our data comes from
the customer_address table within the customer_data data set. So we add customer_data.customer_address
after the FROM clause. Then under SELECT, we’ll write LENGTH, and then the column we want to check, country. To remind ourselves what this is, we can label this column in our
results as letters_in_country. So we add AS letters_in_country, after LENGTH(country). The result we get is a list of the number
of letters in each country listed for each of our customers. It seems like almost all of them are 2s, which means the country field
contains only two letters. But we notice one that has 3.
That’s not good. We want our data to be consistent. So let’s check out which countries
were incorrectly listed in our table. We can do that by putting
the LENGTH(country) function that we created into the WHERE clause. Because we’re telling SQL to
filter the data to show only customers whose country
contains more than two letters. So now we’ll write SELECT country FROM customer_data.customer_address WHERE LENGTH(country) greater than 2. When we run this query, we now get the two
countries where the number of letters is greater than the 2 we expect to find. The incorrectly listed countries
show up as USA instead of US. If we created this table,
then we could update our table so that this entry shows up
as US instead of USA. But in this case, we didn’t create
this table, so we shouldn’t update it. We still need to fix this problem so
we can pull a list of all the customers in the US, including the two
that have USA instead of US. The good news is that we can account for this error in our results by using
the substring function in our SQL query. To write our SQL query,
let’s start by writing the basic structure, SELECT, FROM, WHERE. We know our data is coming from
the customer_address table from the customer_data data set. So we type in
customer_data.customer_address, after FROM. Next, we tell SQL what data
we want it to give us. We want all the customers
in the US by their IDs. So we type in customer_id after SELECT. Finally, we want SQL to filter
out only American customers. So we use the substring function
after the WHERE clause. We’re going to use the substring function
to pull the first two letters of each country so that all of them are consistent
and only contain two letters. To use the substring function, we first need to tell SQL the column
where we found this error, country. Then we specify which
letter to start with. We want SQL to pull the first two letters,
so we’re starting with the first letter,
so we type in 1. Then we need to tell SQL how many letters,
including this first letter, to pull. Since we want the first two letters, we need SQL to pull two total letters,
so we type in 2. This will give us the first
two letters of each country. We want US only, so
we’ll set this function to equals US. When we run this query, we get a list
of all customer IDs of customers whose country is the US, including
the customers that had USA instead of US. Going through our results, it seems
like we have a couple duplicates where the customer ID is shown multiple times. Remember how we get rid of duplicates? We add DISTINCT before customer_id. So now when we run this query, we have our final list of customer IDs
of the customers who live in the US. Finally, let’s check out the TRIM
function, which you’ve come across before. This is really useful if you find
entries with extra spaces and need to eliminate those extra spaces for
consistency. For example, let’s check out the state
column in our customer_address table. Just like we did for the country column, we want to make sure the state column
has the consistent number of letters. So let’s use the LENGTH function again to
learn if we have any state that has more than two letters, which is what we
would expect to find in our data table. We start writing our SQL
query by typing the basic SQL structure of SELECT, FROM, WHERE. We’re working with the customer_address
table in the customer_data data set. So we type in
customer_data.customer_address after FROM. Next, we tell SQL what we want it to pull. We want it to give us any state
that has more than two letters, so we type in state, after SELECT. Finally, we want SQL to filter for
states that have more than two letters. This condition is written
in the WHERE clause. So we type in LENGTH(state), and that it must be greater
than 2 because we want the states that have
more than two letters. We want to figure out what the incorrectly
listed states look like, if we have any. When we run this query, we get one result. We have one state that has
more than two letters. But hold on, how can this state
that seems like it has two letters, O and H for Ohio,
have more than two letters? We know that there are more than
two characters because we used the LENGTH(state) > 2 statement in the
WHERE clause when filtering out results. So that means the extra characters that
SQL is counting must then be a space. There must be a space after the H. This is where we would
use the TRIM function. The TRIM function removes any spaces. So let’s write a SQL query
that accounts for this error. Let’s say we want a list of all customer
IDs of the customers who live in “OH” for Ohio. We start with the basic SQL structure:
SELECT, FROM, WHERE. We know the data comes
from the customer_address table in the customer_data data set, so we type in customer_data.customer_address
after FROM. Next, we tell SQL what data we want. We want SQL to give us the customer
IDs of customers who live in Ohio, so we type in customer_id after SELECT. Since we know we have some
duplicate customer entries, we’ll go ahead and
type in DISTINCT before customer_id to remove any duplicate customer
IDs from appearing in our results. Finally, we want SQL to
give us the customer IDs of the customers who live in Ohio. We’re asking SQL to filter the data,
so this belongs in the WHERE clause. Here’s where we’ll use the TRIM function. To use the TRIM function,
we tell SQL the column we want to remove spaces from,
which is state in our case. And we want only Ohio customers,
so we type in = ‘OH’. That’s it. We have all customer IDs
of the customers who live in Ohio, including that customer with
the extra space after the H. Making sure that your string variables
are complete and consistent will save you a lot of time later by avoiding
errors or miscalculations. That’s why we clean data
in the first place. Hopefully functions like length,
substring, and trim will give you the tools you need to start working with
string variables in your own data sets. Next up, we’ll check out some other
ways you can work with strings and more advanced cleaning functions. Then you’ll be ready to start
working in SQL on your own. See you soon.

Practice Quiz: Hands-On Activity: Clean data using SQL

Reading

Hands-On-Activity_-Clean-data-using-SQL Download

automobile_data-automobile_data Download

Practice Quiz: Test your knowledge on SQL queries

Which of the following SQL functions can data analysts use to clean string variables? Select all that apply.

TRIM, SUBSTR

Data analysts can use the SUBSTR and TRIM functions to clean string variables.

You are working with a database table that contains data about playlists for different types of digital media. The table includes columns for playlist_id and name. You want to remove duplicate entries for playlist names and sort the results by playlist ID.
You write the SQL query below. Add a DISTINCT clause that will remove duplicate entries from the name column.
NOTE: The three dots (…) indicate where to add the clause.

SELECT 
    DISTINCT name
FROM
playlist
ORDER BY
playlist_id

+----------------------------+
| name                       |
+----------------------------+
| Music                      |
| Movies                     |
| TV Shows                   |
| Audiobooks                 |
| 90’s Music                 |
| Music Videos               |
| Brazilian Music            |
| Classical                  |
| Classical 101 - Deep Cuts  |
| Classical 101 - Next Steps |
| Classical 101 - The Basics |
| Grunge                     |
| Heavy Metal Classic        |
| On-The-Go 1                |
+----------------------------+

What playlist name appears in row 6 of your query result?

Music Videos

The clause DISTINCT name will remove duplicate entries from the name column. The complete query is SELECT DISTINCT name FROM playlist ORDER BY playlist_id. The DISTINCT clause removes duplicate entries from your query result. The playlist name Music Videos appears in row 6 of your query result.

You are working with a database table that contains data about music albums. The table includes columns for album_id, title, and artist_id. You want to check for album titles that are less than 4 characters long.
You write the SQL query below. Add a LENGTH function that will return any album titles that are less than 4 characters long.

SELECT 
*
FROM
album
WHERE
    LENGTH(title) < 4

| album_id | title | artist_id |
+----------+-------+-----------+
|      131 | IV    |        22 |
|      181 | Ten   |       118 |
|      182 | Vs.   |       118 |
|      236 | Pop   |       150 |
|      239 | War   |       150 |
+----------+-------+-----------+

What album ID number appears in row 3 of your query result?

182

The function LENGTH(title) < 4 will return any album names that are less than 4 characters long. The complete query is SELECT * FROM album WHERE LENGTH(title) < 4. The LENGTH function counts the number of characters a string contains. The album ID number 182 appears in row 3 of your query result.

You are working with a database table that contains customer data. The table includes columns about customer location such as city, state, and country. You want to retrieve the first 3 letters of each country name. You decide to use the SUBSTR function to retrieve the first 3 letters of each country name, and use the AS command to store the result in a new column called new_country.
You write the SQL query below. Add a statement to your SQL query that will retrieve the first 3 letters of each country name and store the result in a new column as new_country.
NOTE: The three dots (…) indicate where to add the statement.

SELECT 
customer_id,
SUBSTR(country, 1, 3) AS new_country
FROM
customer

ORDER BY
country

+-------------+-------------+
| customer_id | new_country |
+-------------+-------------+
|          56 | Arg         |
|          55 | Aus         |
|           7 | Aus         |
|           8 | Bel         |
|           1 | Bra         |
|          10 | Bra         |
|          11 | Bra         |
|          12 | Bra         |
|          13 | Bra         |
|           3 | Can         |
|          14 | Can         |
|          15 | Can         |
|          29 | Can         |
|          30 | Can         |
|          31 | Can         |
|          32 | Can         |
|          33 | Can         |
|          57 | Chi         |
|           5 | Cze         |
|           6 | Cze         |
|           9 | Den         |
|          44 | Fin         |
|          39 | Fra         |
|          40 | Fra         |
|          41 | Fra         |
+-------------+-------------+
(Output limit exceeded, 25 of 59 total rows shown)

What customer ID number appears in row 2 of your query result?

The statement SUBSTR(country, 1, 3) AS new_country will retrieve the first 3 letters of each state name and store the result in a new column as new_country. The complete query is SELECT customer_id, SUBSTR(country, 1, 3) AS new_country FROM customer ORDER BY country. The SUBSTR function extracts a substring from a string. This function instructs the database to return 3 characters of each country, starting with the first character. The customer ID number 55 appears in row 2 of your query result.

Transforming data

Reading: Optional: Upload the store transactions dataset to BigQuery

Reading

Optional_-Upload-the-store-transactions-dataset-to-BigQuery Download

Video: Advanced data-cleaning functions, part 1

Notes

Video

Tutorial

Transcript

Topic: Cleaning and Formatting Data with the CAST Function in SQL

Key Points:

Problem: Data imported from external sources may have incorrect data types, hindering analysis.
Solution: The CAST function converts data from one type to another.
Example: Lauren’s Furniture Store needs purchase prices sorted correctly but data is stored as strings.
Applying CAST:
- Convert purchase_price to FLOAT64 using CAST(purchase_price as FLOAT64).
- Update ORDER BY clause to use the casted field.
Benefits:
- Sort data accurately across different data types.
- Prepare data for analysis and avoid misleading results.
- Useful for various data types like dates and times.
Further Learning:
- Advanced functions for data cleaning and transformation.

Overall:

The CAST function is a valuable tool for data analysts to ensure data consistency and accuracy for insightful analysis.

Here’s a tutorial on Cleaning and Formatting Data with the CAST Function in SQL:

Understanding Data Types and CAST:

Data Types: SQL categorizes data into types like integers, floats, strings, dates, and more.
CAST Function: Converts a value from one data type to another.
Syntax: CAST(expression AS data_type)

When to Use CAST:

Imported Data: Ensure correct data types after importing from external sources.
Data Mismatch: Correct inconsistencies in data types for accurate analysis.
Specific Data Manipulation: Convert data for certain operations, like sorting or calculations.

Example: Sorting Purchase Prices:

Original Query (Incorrect Sorting):

SELECT purchase_price
FROM customer_data.customer_purchase
ORDER BY purchase_price DESC;

Applying CAST:

SELECT CAST(purchase_price AS FLOAT64) AS purchase_price_float
FROM customer_data.customer_purchase
ORDER BY purchase_price_float DESC;

Key Points:

Check Data Types: Use DESCRIBE table_name or SELECT * FROM table_name LIMIT 10 to verify data types.
Common Data Type Conversions:
- Text to numbers: CAST(text_column AS FLOAT64) or CAST(text_column AS INT64)
- Numbers to text: CAST(number_column AS STRING)
- Dates and times: CAST(date_string AS DATE), CAST(time_string AS TIME), CAST(datetime_string AS DATETIME)
Handle Errors: Use TRY_CAST to avoid errors if a value can’t be converted.

Additional Tips:

Permanent Change: Use ALTER TABLE to permanently change a column’s data type.
Performance Considerations: Excessive CASTing can impact query performance.
Explore Alternatives: Consider other functions like CONVERT or data type-specific conversion functions for specific needs.

Remember:

CAST is a powerful tool for ensuring data consistency and enabling accurate analysis in SQL.
Use it effectively to clean and format your data, leading to reliable insights.

Hi there and welcome back. So far we’ve gone over
some basic SQL queries and functions that can help
you clean your data. We’ve also checked out some ways you can
deal with string variables in SQL to make your job easier. Get ready to learn more functions for
dealing with strings in SQL. Trust me, these functions will be really
helpful in your work as a data analyst. In this video,
we’ll check out strings again and learn how to use the cast function
to correctly format data. When you import data that doesn’t
already exist in your SQL tables, the data types from the new dataset
might not have been imported correctly. This is where the CAST
function comes in handy. Basically, CAST can be used to convert
anything from one data type to another. Let’s check out an example. Imagine we’re working with
Lauren’s Furniture Store. The owner has been collecting
transaction data for the past year, but she just discovered that they can’t
actually organize their data because it hadn’t been formatted correctly. So we’ll help her by converting
her data to make it useful again. For example,
let’s say we want to sort all purchases by purchase_price in descending order. That means we want the most expensive
purchase to show up first in our results. To write the SQL query,
we start with the basic SQL structure. SELECT, FROM, WHERE,
we know the data is stored in the customer_purchase table
in the customer_dataset. So we write
customer_data.customer_purchase after FROM. Next, we tell SQL what data to
give us in the select clause. We want to see the purchase_price data, so we type purchase_price after SELECT. Next is the where clause. We are not filtering out any
data since we want all purchase prices shown, so
we can take out the where clause. Finally, to sort the purchase_price in descending order, we type ORDER BY purchase_price DESC at
the end of our query. Let’s run this query. We see that 89.85 shows up at the top with 799.99 below it, but we know that 799.99 is
a bigger number than 89.85. The database doesn’t recognize
that these are numbers, so it didn’t sort them that way. If we go back to the customer_purchase
table and take a look at its schema, we can see what data type the database
thinks purchase_price is. It says here the database thinks
purchase_price is a string, when in fact it is a float,
which is a number that contains a decimal. That is why 89.85 shows up before 799.99. When we sort letters, we start from the first letter before
moving on to the second letter. So if we want to sort the words apple and
orange in descending order, we start with the first letters a and o. Since o comes after a,
orange will show up first, then apple. The database did the same with 89.85 and
799.99. It started with the first letter, which
in this case was 8 and 7 respectively. Since 8 is bigger than 7,
the database sorted 89.85 first and then 799.99 because the database
treated these as text strings. The database doesn’t recognize these
strings as floats because they haven’t been typecast to match that data type yet. Typecasting means converting
data from one type to another, which is what we’ll do
with the CAST function. We use the CAST function to
replace purchase_price with a new purchase_price that the database
recognizes as float instead of string. We start by replacing
purchase_price with CAST. Then we tell SQL the field we want to
change, which is the purchase_price field. Next is a data type we want
to change purchase_price to, which is the FLOAT data type. BigQuery stores numbers
in a 64 bit system, so the FLOAT data type is referenced
as float 64 in our query. This might be slightly different in other
SQL platforms, but basically the 64 and float 64 just indicates that we’re casting
numbers in the 64 bit system as FLOATs. We also need to sort this new field so
we change purchase_price after ORDER BY to CAST
purchase_price as FLOAT64. This is how we use the cast
function to allow SQL to recognize the purchase_price column as
FLOATs instead of text strings. Now we can sort our
purchases by purchase_price. And just like that, Lauren’s Furniture Store has data that
can actually be used for analysis. As a data analyst, you’ll be asked
to locate and organize data a lot, which is why you want to make sure you
convert between data types early on. Businesses like our Furniture Store
are interested in timely sales data, and you need to be able to account for
that in your analysis. The CAST function can be used to change
strings into other data types too, like date and time. As a data analyst, you might find
yourself using data from various sources. Part of your job is making sure the data
from those sources is recognizable and usable in your database so that you won’t
run into any issues with your analysis. And now you know how to do that. The CAST function is one great tool
you can use when you’re cleaning data. And coming up, we’ll cover some other advanced functions
that you can add to your toolbox. See you soon.

Video: Advanced data-cleaning functions, part 2

Notes

Video

Transcript

In this video, we learn about the CAST, CONCAT, and COALESCE functions in SQL and how they can be used to clean and manipulate data. Here is a summary of the key points covered:

The CAST function allows us to change the data type of a field in SQL. It can be used to convert datetime fields into date fields for cleaner results.
The CONCAT function is used to combine strings together to create new text strings. It can be helpful when we need to create unique keys or separate data based on certain criteria.
The COALESCE function is used to return non-null values in a list. It can be used to replace missing values with alternative values, making the data easier to read and analyze.
These functions are powerful tools for cleaning and manipulating data in SQL, and they can help us prepare our data for further analysis.
Practice and repetition are key to mastering these concepts, so feel free to rewatch the video and try out the commands on your own.

Remember, these functions are just a few examples of what SQL can do to clean and transform data. As you continue working with SQL, you’ll discover more advanced functions and techniques to enhance your data cleaning process.

Hey there. Great
to see you again. So far, we’ve seen some
SQL functions in action. In this video, we’ll go
over more uses for CAST, and then learn about CONCAT and COALESCE. Let’s get started. Earlier we talked about
the CAST function, which let us typecast text
strings into floats. I called out that the
CAST function can be used to change into
other data types too. Let’s check out
another example of how you can use CAST in
your own data work. We’ve got the transaction
data we were working with from our Lauren’s
Furniture Store example. But now, we’ll check out
the purchase date field. The furniture store owner
has asked us to look at purchases that occurred during their sales promotion
period in December. Let’s write a SQL query
that will pull date and purchase_price for all purchases that occurred between
December 1st, 2020, and December 31st, 2020. We start by writing the
basic SQL structure: SELECT, FROM, and WHERE. We know the data comes from the customer_purchase table
in the customer_data dataset, so we write customer_data.customer_purchase
after FROM. Next, we tell SQL
what data to pull. Since we want date
and purchase_price, we add them into the
SELECT statement. Finally, we want
SQL to filter for purchases that occurred
in December only. We type date BETWEEN
‘2020-12-01’ AND ‘2020-12-31’ in the WHERE clause. Let’s run the query. Four purchases
occurred in December, but the date field looks odd. That’s because the
database recognizes this date field as datetime, which consists of
the date and time. Our SQL query still
works correctly, even if the date field is
datetime instead of date. But we can tell SQL to convert the date field into the date data type so we see just
the day and not the time. To do that, we use the
CAST() function again. We’ll use the CAST() function to
replace the date field in our SELECT statement with the new date field that will show the date and not the time. We can do that by typing
CAST() and adding the date as the field
we want to change. Then we tell SQL the data
type we want instead, which is the date data type. There. Now we can have cleaner results
for purchases that occurred during the
December sales period. CAST is a super useful function for cleaning and sorting data, which is why I wanted you to see it in action one more time. Next up, let’s check out
the CONCAT function. CONCAT lets you add
strings together to create new text strings that can
be used as unique keys. Going back to our
customer_purchase table, we see that the
furniture store sells different colors of
the same product. The owner wants to know
if customers prefer certain colors, so the owner can manage store inventory
accordingly. The problem is, the product_code is the same, regardless
of the product color. We need to find another way
to separate products by color, so we can tell if
customers prefer one color over the others. We’ll use CONCAT to produce
a unique key that’ll help us tell the
products apart by color and count them more easily. Let’s write our SQL
query by starting with the basic structure:
SELECT, FROM, and WHERE. We know our data comes from the customer_purchase table and the customer_data dataset. We type “customer_data.customer_purchase”
after FROM Next, we tell SQL
what data to pull. We use the CONCAT()
function here to get that unique key of
product and color. So we type CONCAT(), the first column we want, product_code, and the other column we want, product_color. Finally, let’s say we
want to look at couches, so we filter for
couches by typing product = ‘couch’
in the WHERE clause. Now we can count how many times each couch was purchased and figure out if customers preferred one color over the others. With CONCAT, the furniture
store can find out which color couches are the
most popular and order more. I’ve got one last
advanced function to show you, COALESCE. COALESCE can be used to return
non-null values in a list. Null values are missing values. If you have a field that’s
optional in your table, it’ll have null in
that field for rows that don’t have appropriate
values to put there. Let’s open the
customer_purchase table so I can show you what I mean. In the customer_purchase table, we can see a couple rows where product information is missing. That is why we see nulls there. But for the rows where
product name is null, we see that there is
product_code data that we can use instead. We’d prefer SQL to show
us the product name, like bed or couch, because it’s easier
for us to read. But if the product
name doesn’t exist, we can tell SQL to give us
the product_code instead. That is where the COALESCE
function comes into play. Let’s say we wanted a list of all products that were sold. We want to use the
product_name column to understand what kind
of product was sold. We write our SQL query with the basic SQL structure:
Select, From, AND Where. We know our data comes from customer_purchase table and
the customer_data dataset. We type “customer_data.customer_purchase”
after FROM. Next, we tell SQL
the data we want. We want a list of product names, but if names aren’t available, then give us the product code. Here is where we type “COALESCE.” then we tell SQL which column
to check first, product, and which column to check second if the first column is
null, product_code. We’ll name this new
field as product_info. Finally, we are not
filtering out any data, so we can take out
the WHERE clause. This gives us product
information for each purchase. Now we have a list
of all products that were sold for the
owner to review. COALESCE can save you time when you’re
making calculations too by skipping any null values and keeping your math correct. Those were just some of the advanced functions
you can use to clean your data and get it ready for the next step in the
analysis process. You’ll discover more as you
continue working in SQL. But that’s the end of this
video and this module. Great work. We’ve
covered a lot of ground. You learned the different data- cleaning functions
in spreadsheets and SQL and the benefits of using SQL to deal
with large datasets. We also added some SQL formulas and functions to your toolkit, and most importantly, we
got to experience some of the ways that SQL
can help you get data ready for your analysis. After this, you’ll get to spend some time learning how
to verify and report your cleaning results
so that your data is squeaky clean and your
stakeholders know it. But before that, you’ve got another weekly challenge to
tackle. You’ve got this. Some of these concepts might
seem challenging at first, but they’ll become
second nature to you as you progress
in your career. It just takes time and practice. Speaking of practice, feel
free to go back to any of these videos and rewatch or even try some of these
commands on your own. Good luck. I’ll see you
again when you’re ready.

Module 3 challenge

Reading: Glossary: Terms and definitions

Reading

Course-4-Module-3-Glossary-_-DA-terms-and-definitions Download

Quiz: Module 3 challenge

A data analyst is analyzing medical data for a health insurance company. The dataset contains billions of rows of data. Which of the following tools will handle the data most efficiently?

SQL

A team of data analysts is working on a large project that will take months to complete and contains a huge amount of data. They need to document their process and communicate with multiple databases. The team decides to use a SQL server as the main analysis tool for this project and SQL for the queries. What makes this the most efficient tool? Select all that apply.

SQL efficiently handles large amounts of data.
SQL records queries and changes throughout a project.
SQL allows you to connect to multiple databases.

A data analyst runs a SQL query to extract some data from a database for further analysis. How can the analyst save the data? Select all that apply.

Create a new table for the data.

Download the data as a spreadsheet

Use the UPDATE query to save the data.

You are working with a database table named invoice that contains invoice data. The table includes columns for customer_id and total. You want to remove duplicate customers and identify which unique customers have a total greater than 5.
You write the SQL query below. Add a DISTINCT clause that will remove duplicate entries from the customer_id column.
NOTE: The three dots (…) indicate where to add the clause.

SELECT DISTINCT customer_id
FROM
invoice
WHERE total > 5

What customer_id number is located in row 5?

NOTE: The query index starts at 1 not 0.

The clause DISTINCT customer_id will remove duplicate entries from the customer_id column. The complete query is SELECT DISTINCT customer_id FROM invoice WHERE total > 5.

You are working with a database table named customer that contains customer data. The table includes columns about customer location such as city, state, country, and postal_code. You want to check for postal codes that are greater than 7 characters long.
You write the SQL query below. Add a LENGTH function that will return any postal_code that is greater than 7 characters long.
NOTE: The three dots (…) indicate where to add the clause.

SELECT
*
FROM
customer
WHERE LENGTH(postal_code) > 7

What is the last name of the customer that is in row 10 of your query result?

NOTE: The query index starts at 1 not 0.

Hughes

The function LENGTH(postal_code) > 7 will return any postal codes that are greater than 7 characters long. The complete query is SELECT * FROM customer WHERE LENGTH(postal_code) > 7. The LENGTH function counts the number of characters a string contains. Hughes is the last name of the customer that appears in row 10 of your query result.

In SQL databases, True/False values refers to what data type?

Boolean

Your current database consists of multiple tables. You need to join three tables in order to build your dataset. In each table you notice that Boolean columns are of the data type string and integer columns are of the data type float. What function can you use to convert these columns to the correct data type while joining your tables?

CAST

Fill in the blank: The _____ function can be used to return non-null values in a list.

COALESCE

You are working with a database table that contains invoice data. The table includes columns about billing location such as billing_city, billing_state, and billing_postal_code. You use the SUBSTR function to retrieve the first 4 numbers of each billing_postal_code, and use the AS command to store the result in a new column called new_postal_code.
You write the SQL query below. Add a statement to your SQL query that will retrieve the first 4 numbers of each billing postal code and store the result in a new column as new_postal_code.
NOTE: The three dots (…) indicate where to add the statement.
NOTE: SUBSTR takes in three arguments being column, starting_index, ending_index

SELECT
invoice_id,
SUBSTR(billing_city,1, 4) AS new_city
FROM
invoice
ORDER BY
billing_city

What invoice id is located in row 4?

NOTE: The query index starts at 1 not 0.

206

The statement SUBSTR(billing_postal_code, 1, 4) AS new_postal_code will retrieve the first 4 characters of each postal code and store the result in a new column as new_postal_code. The complete query is SELECT invoice_id, SUBSTR(billing_postal_code, 1, 4) AS new_postal_code FROM invoice ORDER BY billing_city. The SUBSTR function extracts a substring from a string. This function instructs the database to return 4 characters of each postal code, starting with the first character. The invoice ID number 206 is in row 4 of your query result.

Home » Google Career Certificates » Google Data Analytics Professional Certificate » Process Data from Dirty to Clean » Module 3: Cleaning data with SQL

Module 3: Cleaning data with SQL

Using SQL to clean data

Video: Using SQL to clean data

Video: Sally: For the love of SQL

Video: Understanding SQL capabilities

Reading: Using SQL as a junior data analyst

Video: Spreadsheets versus SQL

Reading: SQL dialects and their uses

More information

Practice Quiz: Hands-On Activity: Processing time with SQL

Practice Quiz: Test your knowledge on SQL

Learn basic SQL queries

Reading: Optional: Upload the customer dataset to BigQuery

Video: Widely used SQL queries

Video: Evan: Having fun with SQL

Video: Cleaning string variables using SQL

Practice Quiz: Hands-On Activity: Clean data using SQL

Practice Quiz: Test your knowledge on SQL queries

Transforming data

Reading: Optional: Upload the store transactions dataset to BigQuery

Video: Advanced data-cleaning functions, part 1

Video: Advanced data-cleaning functions, part 2

Module 3 challenge

Reading: Glossary: Terms and definitions

Quiz: Module 3 challenge

Share this:

Like this: