Module 3: Clean your data

You’ll explore three more EDA practices: cleaning, joining, and validating. You’ll discover the importance of these practices for data analysis, and you’ll use Python to clean, validate, and join data.

Learning Objectives

Apply input validation skills to a dataset with Python
Explain the importance of input validation
Demonstrate how to transform categorical data into numerical data with Python
Explain the importance of categorical versus numerical data in a dataset
Explain the importance of recognizing outliers in a dataset
Demonstrate how to identify outliers in a dataset with Python
Understand when to contact stakeholders or engineers regarding missing values
Explain the importance of ethically considering missing values
Demonstrate how to identify missing data with Python

Table Of Contents

The challenge of missing or duplicate data
The ins and outs of data outliers
Change categorical data to numerical data
Input validation
Review: Clean your data

The challenge of missing or duplicate data

Video: Welcome to module 3

Notes

Transcript

Topic: Exploratory Data Analysis (EDA) – Cleaning, Joining, and Validating Data

Key Points:

EDA practices go beyond data discovery & storytelling to include cleaning, joining, and validating data.
This part focuses on missing values, outliers, categorical data transformation, and input validation.
Python notebooks and job-relevant datasets will be used for hands-on practice.
Communication skills are crucial: when to talk to stakeholders/engineers about data issues and consider ethical implications.
These practices, along with effective communication and Python coding, prepare data for further analysis.
Following PACE’s 6 EDA practices saves time, improves data storytelling, and prioritizes analysis effectively.
Real-world example: analysis paralysis vs. focusing on major user issues and clear actions.

Next Steps:

Learn about common data cleaning obstacles and how to tackle them.

Remember:

Cleaning, joining, and validating data are essential components of effective EDA.
Communication and ethical considerations are crucial when dealing with data issues.
Prioritize analysis based on impact and clear action recommendations.

Let’s imagine you work at
a board game manufacturer. As the quality
assurance manager, your job is to make sure every game is put
together correctly. One day you discover that the machine that produces
the games is having technical issues causing
misprinted cards and incorrect counts
of game pieces. Your manager asks you and the rest of the quality
assurance team to go through the affected
boxes and replace misprinted cards
and missing pieces. Your goal is to make
sure each box is usable. Searching through game
boxes for missing an incorrect game pieces
is similar to cleaning, joining, and validating
your data sets as part of exploratory data
analysis or EDA. Earlier, you learned how to
discover and structure data in order to understand and tell it stories in impactful ways. In this part of the
course, we will cover three of the other
practices of EDA: cleaning, joining, and validating data. Although there are
many different ways to clean join and validate data we will focus on
missing values and outliers, the need for transforming
categorical into numerical data and the
importance of input validation. Of course, we won’t just talk about the concepts of
cleaning, joining, and validating, you will learn how to apply them in
a Python notebook setting. We’ll use data sets that are comparable to what you
will see on the job. Along the way, we’ll go
through some important tips for improving your workplace
skills like communication. We will discuss when
to communicate with stakeholders and engineers about missing or outlier
values and about the ethical
implications you must consider when dealing with missing an outlier data values. These practices of EDA along with effective
communication and Python coding are
all essential to preparing a data set for the next steps in
the pace workflow. By following pace while performing the six
practices of EDA you will not only save time and energy on future processes, but you will be
more effective in finding and telling
the data story. One of my first
jobs at Google was a quantitative analyst
on Google Translate. It was my responsibility to help identify areas that we can
improve to help users. I recall a project where I was overly eager to
dive into the data. I spent countless hours
identifying opportunities ranging from when a user first downloads the app
to their second, third, and 20th time
using the product. I was proud of my
findings and I developed a long presentation to
share with my executives. Now while my insights
were well-received, the team was reluctant to invest in any of my ideas
because there were so many and they weren’t sure
which ones to prioritize. I realized after that meeting
that I needed a plan. I’d spent so much time mining
for insights that I was suffering from what my manager
called analysis paralysis. Instead, what I needed
to do was find out what issues were affecting the most number of users
and the severity of it. By organizing my thoughts
using this strategy, I was then able to
hone my analysis into one specific issue and then recommend clear actions to help. Like your job as a quality assurance
manager that we imagined at the
beginning of this video. As a data professional, you may be responsible
for cleaning up data sets so that
they’re ready to use. You won’t always be sure what obstacles you may
face with cleaning data, so let’s get started with
some of the most common.

Video: Methods for handling missing data

Notes

Tutorial

Quiz

Transcript

Summary of Missing Data in Exploratory Data Analysis (EDA):

Challenges:

Missing data (N/A, NaN, blank) can impact data analysis and conclusions.
Reasons for missing data vary: computer errors, forgotten input, etc.
Impact varies, from negligible to substantial, affecting communication with stakeholders.

Example: Sleep habits survey with missing responses; inconclusive results due to 67% missing data.

Communication:

Inform stakeholders if missing data prevents analysis completion and suggest solutions.
Be mindful of ethical considerations when dealing with missing data.

Handling Missing Data:

Request data fill-in: Best if feasible, like sending follow-up surveys.
Delete missing data: Ideal for low percentages and non-skewed data.
Create NaN category: Good for missing categorical data, like “answer not recorded”.
Fill in representative values: Useful for forecasts, with methods like forward/backward filling, mean/median imputation.

Choosing a Method:

Based on experience, intuition, reasoning, and dataset specifics.
Consult peers, managers, and stakeholders depending on impact and ethical considerations.

Key Takeaway:

Be thoughtful and intentional in handling missing data, considering quantity, impact, and ethics. Develop a strategy and plan as in the board game scenario.

Understanding Missing Data and Its Challenges:

Definition: Missing data refers to values that are not stored for a variable in a dataset, often represented as N/A, NaN, or blank. It’s distinct from a data point of zero.
Prevalence and Impact: Missing data is common and can significantly impact analysis and conclusions if not handled properly.
Reasons: Missing data can occur due to various reasons, including computer errors, data entry mistakes, survey non-responses, or values not applicable to certain cases.
Ethical Considerations: Be mindful of potential biases and assumptions when dealing with missing data.

Identifying and Quantifying Missing Data:

Visualization Techniques: Use histograms, box plots, and scatter plots to visualize missing values and their patterns.
Statistical Measures: Calculate the percentage of missing values for each variable to assess the extent of missingness.

Handling Missing Data Strategies:

Request Data Fill-in:
- Contact data owners or conduct follow-up surveys to retrieve missing values.
- Ideal when feasible and data collection is recent.
Delete Missing Data:
- Remove rows or columns with missing values.
- Suitable for low percentages of missingness and when deletion doesn’t introduce bias.
Create Missing Data Category:
- Assign a unique category (e.g., “Missing”) for missing values in categorical variables.
- Preserves information about missingness.
Fill in Representative Values (Imputation):
- Replace missing values with estimates based on existing data.
- Common methods:
  - Mean imputation: Replace with mean of the variable.
  - Median imputation: Replace with median of the variable.
  - Forward filling: Replace with previous non-missing value.
  - Backward filling: Replace with next non-missing value.
  - Predictive modeling: Use regression or machine learning to predict missing values.

Choosing the Best Strategy:

Consider the amount and pattern of missing data, the nature of variables, analysis goals, and ethical implications.
No one-size-fits-all solution; experimentation and evaluation are often necessary.

Additional Tips:

Communicate with stakeholders about missing data and its implications for analysis.
Document your missing data handling approach for transparency and reproducibility.
Consider advanced techniques like multiple imputation for complex scenarios.

Staying Informed:

Keep up-to-date with best practices and emerging techniques for handling missing data.
The field of missing data analysis is constantly evolving.

Data encoded as N/A, NaN, or a blank is defined as zero.

False

Data encoded as N/A, NaN, or a blank is defined as a value that is not stored for a variable in a dataset. This is different from a data point of zero, which may be a missing value or a legitimate data point.

Remember earlier when you imagined
yourself as a quality assurance manager at a company assembling a board game. When you discovered that the machine that
produces the games was having technical issues and
causing game boxes to have pieces missing. You and your company could have chosen
a different number of paths forward. One, you could have thrown
away the impacted game boxes. Two, you could have left them as is and
sold them at a discounted rate. Or three, and
the one you chose in the example, you rummage through the boxes and
corrected the mistakes. There was a plan, strategy, or
protocol for dealing with missing pieces. The same should be true for any missing
values you find in the data set. When you have missing entries in your data
set, you have to decide on a plan for dealing with them. Even more than that, you must be ready
to communicate with stakeholders if missing values impact your analysis. Missing data, which is often
encoded as N/A, NaN or blank, is defined as a value that is not
stored for a variable in a set of data. This is different from
a data point of zero. I’ll explain more on that in a bit. Data professionals often experience
the challenge of missing data. Every data set is different. So, there are a wide variety of
reasons for values to be missing. Everything from a computer
error during a data upload, to someone simply forgetting to input it. Depending on the number of
entries in your data set and the quantity of missing values. The impact of data fields
with a not a number or NaN’s can range from
negligible to substantial. The impact will also determine how
you communicate to stakeholders or clients, ranging from a note at
the bottom of a data visualization. About the fact that there’s missing data,
to a face to face meeting about how analysis cannot be completed due
to the large number of missing values. Here’s an example of the impact
missing data has on a data set and the ethical questions
missing data creates. Imagine you’re a data professional
trying to learn more about sleep habits. You email questionnaire to, say,
100 people and all of them completed. One of the questions asked, do you
sleep exactly eight hours each night? The answer choices are yes, no, or unsure. 9 people answered yes, 9 answered no,
and 15 answered unsure. Because only 33 people responded to
that question, it’s hard to draw a definitive conclusion about
the sleep habits of all 100 students. Since 100 people completed
the questionnaire and only 33 responded to this question, 67% of
the replies would be considered missing. These missing values greatly impact
the questions resulting data. It would be unwise for a data professional
to try to use the 33% response rate in order to draw conclusions
about the entire population. If you can encounter this level of
missing values while working as a data professional, you should communicate
the inability to complete any analysis and suggest possible solutions. Another challenge that can come up
during EDA, is with the value zero. In some data sets, a zero could
be considered a missing value. But in other data sets could
be a legitimate data point. In some data sets, a NaN or
blank space might be a mistake. Someone forgot to fill it in. Or the data may have been left
blank because the column may not apply to that data point. Data professionals unsure of whether or
not the blank spaces intentional, should ask a stakeholder or the owner of
the data to confirm the spaces legitimacy. Whenever you find missing data, you have
a choice to make on how to handle it. As a data professional, you should
consider how the missing data might impact stakeholders and
who should be made aware. Here are four common ways
to handle missing data. One, you requested the missing values
be filled in by the owner of the data. Two, you delete the missing columns,
rows, or values which would work best if the total
count of missing data is relatively low. Three, you create a NaN category. Or four, you derive new representative
values such as taking the median or average of the values that aren’t missing. First, we’ll talk about
filling in the missing data. If there are large quantities of
missing values in the data you receive, the best method to handle the missing data
would be to contact the owners of the data and request that that data be filled in. For example, in our sleep example,
you could send a follow up request for people to log a response to
that particular question. Or you could rephrase the question,
how many hours do you sleep each night? However, in many data sets, you see while
working as a data professional, the option to get missing data filled in may not
be feasible for a variety of reasons. For example, if you’re working on
a project that is time sensitive, you might not have time to gather
more data before your deadline. Or the data cannot be retrieved because
the study happened too long ago. If filling in missing
data isn’t an option, you can also choose the option
of deleting the missing data. Though some might assume it’s
not appropriate to delete data, removing rows or
whole columns of data is ideal. If there isn’t a large
percentage of NaN’s or the values are not going to impact
the business plan for the data. One thing to watch out for though, is discarding missing data
that is not missing at random. Deleting values that have been
left blank intentionally, can skew the results of your analysis. For example, in the sleep questionnaire,
if you were to remove all the people that left that question blank,
you delete a majority of the data. A third option,
is to make the NaN their own category. Which is a good strategy if the missing
data itself is categorical rather than numerical. For example, in the sleep questionnaire,
if you were to put all the non responses into their own
category called answer not recorded, you would be creating a category for
the missing data. Finally, there is the strategy of
filling in the missing data by creating a representative value. This strategy can be more useful
with business plans that call for a predicted value or forecast. There are multiple methods within this
filling in strategy to choose from. Including the four most common,
forward filling, backward filling, deriving mean values, and
deriving median values. We will talk about how to do all these
missing data operations using python in another video. These are the most common methods for
handling missing data. Choosing which method to use comes with
experience, intuition, and reasoning. Sometimes you’ll have the opportunity
to use more than one option with the data set. Every dataset and
project plan will be different. So, you will be making this determination
each time you encounter new data. Sometimes, you’ll want to confer
with peers, or managers, or stakeholders in order to make a decision. The impact missing values
has on your analysis, should be the determining factor
in whom you should contact. As a data professional, it is
essential that you are thoughtful and intentional about how
missing data are addressed. Consider the quantity of NaNs and their importance in relation
to the project plan. Ask yourself, how will this
approach impact this data set? And what are the ethical considerations? Just like the board game scenario, it is essential to determine
a strategy and decide on a plan.

Reading: Data deduplication with Python

Reading

As you’ve learned, the data cleaning and validating practices include several different steps, including handling missing data, outliers, and label encoding; checking for misspellings; and, handling duplicates. As a data professional, it will be your task to know how best to handle data values in those categories. In this reading, you’ll learn more about handling duplicates. You will also learn to identify and decide whether deduplication is the right strategy for a dataset. In addition, you will learn some common Python functions for handling duplicates.

Identifying duplicates

Before we make any decisions on whether to remove duplicate values or not, we should first determine if duplicate values are present in our dataset.

A simple way to identify duplicates is to use the duplicated() function from Pandas. duplicated() is a method of the DataFrame class.

This function returns a series of “true/false” outputs, with “true” indicating the data value is a duplicate, and “false” indicating it is a unique value.

Here’s an example of a five-row dataframe:

df

     brand    style  rating
0   Wowyow  cistern     4.0
1   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

Using the duplicated() function, the result is that one has been marked “True,” indicating it is a duplicate.

print(df)
print()
print(df.duplicated())

     brand    style  rating
0   Wowyow  cistern     4.0
1   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

0    False
1     True
2    False
3    False
4    False
dtype: bool

Identifying duplicates for an entire dataframe will be different than a single column or index. Be sure when you use the duplicated() function for an entire dataframe. The duplicated() function will only return entire rows that have exactly matching values, not just individual matching values found within a column. If you wish to identify duplicates for only one column or a series of columns within a dataframe, you will need to include that in the “subset” portion of the argument field of the duplicated() function. Going further, if you’d like to specify which of the duplicates to keep as the “original” as opposed to the duplicate, you can specify that in the keep portion of the argument field.

Below is an example of identifying duplicates in only one column (subset) of values and labeling the last duplicates as “false,” so that they are “kept”:

print(df)
print()
print(df.duplicated(subset=['type'], keep='last'))

    color  rating     type
0   olive     9.0    rinds
1   olive     9.0    rinds
2    gray     4.5  pellets
3  salmon    11.0  pellets
4  salmon     7.0  pellets

0     True
1    False
2     True
3     True
4    False
dtype: bool

Decision time: To drop or not to drop?

As you’ve learned, every dataset is unique and you cannot treat every dataset the same. When you are making the decision on whether to eliminate duplicate values or not, think deeply about the dataset itself and about the objective you wish to achieve. What impact will dropping duplicates have on your dataset and your objective?

1. Deciding to drop

You should drop or eliminate duplicate values if duplicate values are clearly mistakes or will misrepresent the remaining unique values in the dataset.

For example, you can be reasonably sure that a data professional will (in most cases) eliminate duplicate values of a dataset containing house addresses and house prices. Counting the same house twice will (in most cases) misrepresent any conclusions drawn from the dataset as a whole, such as average house price, total house price, or even total number of houses. In a case like this, a data professional would almost certainly eliminate the duplicate data so as to fairly represent the remaining data during analysis and visualization.

2. Deciding to NOT drop

You should keep duplicated data in your dataset if the duplicate values are clearly not mistakes and should be taken into account when representing the dataset as a whole.

For example, a dataset marking the number of throws and distances of an Olympic shot-put athlete in training will likely include several duplicate distances; just by nature of number of attempts and the limits a person can have a weighted ball, there will be duplicate values—particularly if the distance measurements are labeled to only 1 or 2 decimal places. In a case like this, a data professional would almost certainly keep all of the data to fairly represent it as a whole during analysis and visualization.

Don’t be duped — How to do deduplication

Before we get back into Python and learn how to eliminate duplicates, let’s first define the term “deduplication”:

Deduplication: The elimination or removal of matching data values in a dataset.

There are a number of different libraries, functions, and methods in Python you could use to remove matching data values.

One of the more common functions to use is in Pandas: drop_duplicates()

drop_duplicates() is another DataFrame method. It’s used to create a new dataframe with all of the duplicate rows removed.

For example, use a dataframe from earlier in this reading:

df

     brand    style  rating
0   Wowyow  cistern     4.0
1   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

Now apply the drop duplicates function:

df.drop_duplicates()

     brand    style  rating
0   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

You’ll notice in the resulting output that the duplicate row of data was removed, leaving the remaining unique values intact.

Note: Keep in mind that the drop_duplicates() function as written above will only drop duplicates of exact matches of entire rows of data. If you wish to drop duplicates within a single column, you will need to specify which columns to check for duplicates using the subset keyword argument.

This example drops all rows that have duplicate values in the style column (except for the first occurrence):

print(df)
df = df.drop_duplicates(subset='style')

print()
print(df)

     brand    style  rating
0   Wowyow  cistern     4.0
1   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

     brand    style  rating
0   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3

And this example drops all rows (except the first occurrence) that have duplicate values in both the style and rating columns:

print(df)
df = df.drop_duplicates(subset=['style', 'rating'])

print()
print(df)

     brand    style  rating
0   Wowyow  cistern     4.0
1   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

     brand    style  rating
0   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

Key Takeaways

Identifying duplicate data values in a dataset is an important part of EDA (or “Exploratory Data Analysis”) practices, specifically cleaning and validating. After identifying duplicates, think about the impact to the dataset and your analysis objective when choosing to eliminate duplicates or not eliminate duplicates.

Additional Resources

Want to learn more about duplicates and deduplication? Check out the following additional links.

Lab: Annotated follow-along guide: Work with missing data in a Python notebook

Reading

Annotated-follow-along-guide_-Dealing-with-missing-data-in-Python Download

Video: Work with missing data in a Python notebook

Notes

Tutorial

Notes

Transcript

Identifying and Analyzing Missing Data in NOAA Lightning Strike Data

This video focuses on identifying and analyzing missing data in two NOAA lightning strike datasets for August 2018. Here’s a summary:

Data Sets:

df: Contains columns like date, latitude, longitude, and number of strikes.
df_zip: Includes additional columns like zip code, city, state, and state code.

Identifying Missing Data:

Merging the dataframes reveals missing values in “zip code,” “city,” “state,” and “state code” columns of df_zip.
Using pd.isnull.all and the shape function with the merged dataframe gives the total number of missing entries: 393,830.
The info function shows the “non-null count” for each column, confirming which ones have missing data.

Analyzing Missing Data:

A new dataframe, df_null_geo, is created to focus on missing values.
A map visualization using plotly express scatter geo reveals that missing values cluster near water bodies like lakes and the ocean.

Key Takeaways:

Missing data are likely due to no state abbreviations or zip codes for bodies of water.
Data cleaning can uncover hidden stories from the data, like the relationship between missing values and water locations.

Additional Notes:

The video emphasizes the importance of data cleaning and how it can lead to new insights.
Tips for optimizing visualizations with large datasets are mentioned, like using the express package.

By following these steps and understanding the context, you can effectively identify and analyze missing data in your own datasets to gain valuable insights.

Identifying and Analyzing Missing Data in NOAA Lightning Strike Data: A Tutorial

This tutorial explores the identification and analysis of missing data within two NOAA lightning strike datasets for August 2018. By working through these steps, you’ll gain valuable insight into handling missing data in your own projects.

Data Sets:

df: Contains basic information like date, latitude, longitude, and number of strikes.
df_zip: Extends df with additional details like zip code, city, state, and state code.

Identifying Missing Data:

Merging Dataframes: Combine df and df_zip using pandas’ merge function.
Locating Missing Values: Analyze the merged dataframe to identify missing entries in columns like “zip code,” “city,” “state,” and “state code” (likely absent for water bodies).
Quantifying Missing Data:
- Use pd.isnull.all and the shape function to find the total number of missing entries (e.g., 393,830).
- Utilize the info function to examine the “non-null count” for each column, confirming which ones have missing data.

Analyzing Missing Data:

Focus on Missing Values: Create a new dataframe, df_null_geo, containing only relevant columns and the missing values (e.g., latitude, longitude, “number of strikes y” with missing zip code data).
Visualization with Plotly Express:
- Import the express package from plotly for efficient visualization.
- Use plotly express scatter geo to create a map highlighting the locations of missing data points.
- Filter the data to focus on areas with significant lightning strikes (e.g., > 300).
- Size the data points based on the “number of strikes y” values.
- Customize the plot with a title and appropriate geographic scope (e.g., geo_scope='usa').

Key Takeaways:

Missing data in “zip code,” “city,” and “state code” columns likely occur for locations like water bodies, which lack these designations.
Data cleaning and analysis can reveal hidden insights, such as the association between missing values and water locations.
Optimizing visualizations with large datasets is crucial (e.g., using express package).

Further Exploration:

Experiment with different data cleaning techniques to handle missing values based on your specific needs and analysis goals.
Investigate the potential impact of missing data on your analysis and conclusions.
Consider alternative visualization methods to effectively represent the distribution and characteristics of missing data.

By following these steps and applying them to your own datasets, you can gain a deeper understanding of missing data and its implications for your analysis, leading to more robust and reliable conclusions.

Remember, data cleaning and analysis are essential steps in any data-driven endeavor, and effectively handling missing data is key to drawing accurate and meaningful insights from your information.

What is indicated by the term null?

The data is missing.

The term null indicates that the data is missing.

Now that we reviewed the importance of
having a plan to address missing data. Let’s talk about how to identify and decide how to work with missing
data in a Python notebook. We’ll use the data you’re familiar with, the group of NOAA
lightning strike data sets. The goal of this video is to learn
how to identify missing data. You’ll be provided two different
slices of data from August 2018. One slice will have the columns date,
center point geom, latitude, longitude,
and number of strikes. The second slice of the data will include
the same columns as the first slice, along with the zip code,
state, city, and state code. You’ll learn through comparisons of these
two data sets, how to find missing data. Before we look for any missing values, we need to import some Python
libraries and packages. We’ll use pandas, NumPy, seaborn, datetime, and pyplot from matplotlib. We will begin by looking at only
the number of lightning strikes for August 2018. Using less data in a Python
notebook allows you to spend more time coding and
less time waiting for the code to run. For this exercise, our first data set will have two
additional columns called latitude and longitude which were points pulled
from the center point geom column. Let’s explore the column pandas and
the overall size of this first data set using the functions that you’re
already familiar with, head and shape. We’ll find the two additional
columns we just mentioned for latitude and longitude as well as
a shape of 717,530 rows and 5 columns. Let’s put this aside as our first
data frame saved as df.shape. Next, we’ll use the second data set for
August 2018. This one has columns for
zip code which is a postcode delivery area number in the US,
as well as city, state, and a column titled state code that has two
letter abbreviation of states in the US. We’ll call the second data frame df zip, so we don’t get confused with the first data set we created earlier. Let’s run the same two functions head and shape on df _zip. The head function returns what we expected out of the data set. The data columns, zip code, city, state, state code center point geom, and number of strikes. The shape function, however, returned 323,700 rows and 7 columns. Given that this data frame is also from August 2018, we expected to see the same number of rows as the first data frame, 717,530. To further explore this, let’s put the two data frames together with the merge function. We’ll need to create a new data frame. We’ll call it df _joined = df .merge. This df .merge indicates that we want to merge the first data frame we saved named df with another data frame. For the merge put df _zip in the argument field, which is the data frame we want to merge with df. Next, fill in the parameters for the merge, including how= ‘ left ‘ ,. Finally, input the last parameter on= ‘date’, center _point _geom ‘. These two parameters how and on, tell Python which way to join the data. When we run the head function on our new data frame we find NaNs listed for the columns zip code, city, state, state code, and number of strikes y. Now, that the data is merged, let’s use a basic function we learned in a previous lesson to search for missing data, describe. Input, df _joined.describe( ) and leave the argument field blank. We run the cell, we find that the total count of lightning strikes has been divided into two, number of strikes x and number of strikes y. Once we merge the two data frames, pandas automatically separated number of strikes into data entries with full data rows and data entries that have missing entries. We can find the total amount of data that’s missing using this next piece of code. We’ll create a data frame called df _null _geo by taking our joined data frame df _joined and then adding the pandas function pd.isnull all one word. The term null indicates that the data is missing. This function pulls all of the missing values from the df _joined data frame. We already know which of the columns have missing values. Since we’re interested in finding the total, we can use any one of those columns. For this code, we selected state_code. After that, we’ll use the shape function to give us the total rows and columns, which comes to 393,830 and 10 respectively. For even more detail, use the info function which we used in another video. When we input df _joined.info( ) with the argument field left blank. We get the column names and another column called non_null count. Non-null count is the total number of data entries for a data column that are not blank. A quick check of the non-null count column helps us confirm which columns have missing data, zip code, city, state, and state code. In fact, if we subtract the total number of strikes x and the number of strikes y the result is the number we identified earlier, 393,830. Let’s take a look at the top portion of the new data frame we created, df _null geo.head using
the head function. The output tells us exactly what
we expected to find in the data frame we created. In the top five lines of data,
the columns zip code, city, state, state code and number of strikes y have
the NaN values in the cells. Now that we have pinpointed exactly what
values are missing in our data frame. The last thing we’ll do in this notebook
is learn how these missing values impact our data. The best way to do that is to
create a data visualization. A plotted map will help us see where
the majority of the missing values are located geographically. To design the map, let’s start
by creating another data frame. This one we’ll call top_missing. With this data frame we
are gathering only the data coms we will need in order to plot
a geographic visualization, latitude, longitude, and
number of strikes x. As you’ll recall from
earlier in this video, number of strikes x
includes all of the 717,530 data rows while number of
strikes y are a segment of those 717,530 with missing data in the zip code, state, and state code columns. You’ll find that it’s helpful to group the
columns by latitude first then longitude. This will make it easier for
most data processors to plot. Lastly, we’ll sort the lightning
strikes x column by some of values. In the ascending field will input,
False, because we want the largest sums of lightning strikes
at the top of the data frame. Finally, let’s take a look at what we
built using the head function again. After running the cell,
the first 10 rows show the most strikes are falling on and
around the same latitudes and longitudes. Let me give you a helpful hint before
we plot a visualization using this data frame. You should import the express
package from plotly first. Express is a helpful package that
speeds up coding by doing a lot of the back-end work for you. If we don’t use express for this
particular data set which has hundreds of thousands of points to plot your
run cell times could be long or the code could even break. Now, it’s time to create the map. Our format for the graph will be from
plotly express called scatter geo. As indicated by its name,
this graph type is used for scatter plots on a geographic map. In the argument field will input the data
frame we just created, top missing. For the parameters in the argument field, we want to filter the number of
values to be only those latitudes and longitudes with lightning
strikes of more than 300. Naturally, we plug latitude and longitude
into their appropriate parameter spots. Then for the size parameter, we plug in
the final column, number of strikes x. Lastly, let’s use the update
layout function within the figure to give the plot title,
Missing data. Now, we create it. It’s a nice geographic visualization but
we really don’t need the global scale. Let’s scale it down to only the geographic
area that we are interested in, the United States. So, let’s copy the same
code we just wrote, this time though we add
the parameter geo_scope= ‘usa’ ,. This will limit the geographic scope
to just the United States of America. Then we can plot it again and
check the output. The resulting map shows a majority
of these missing values cropping up along the borders or
in spots over bodies of water like lakes, the ocean, or the Gulf of Mexico. Given that the missing data were
in the state abbreviations and zip code columns, it does make sense
why those data points were left blank. There are no zip codes for
bodies of water. You’ll also notice some other locations
with missing data that are not over bodies of water. These types of missing data,
latitude and longitude on land are the kind of missing data you would
want to reach out to the NOA A about. What’s interesting about cleaning
the lightning strike data and looking for missing data values is that we
learn something new in the process. We found one of the stories
hidden in the data. The data values were missing because
the lightning strikes were over water. It goes to show you that data
cleaning may sound tedious but you’ll never know what you learn.

Video: Remy: A day in the life of a data professional

Notes

Transcript

Summary of Remy’s Customer Engineer Story:

Remy:

A Google Cloud customer engineer specializing in data analytics and machine learning.
Previously had many diverse jobs but found her niche in data science.
Values the automation and problem-solving aspects of data science.

Typical Day:

Answers customer questions via email and meetings.
Stays updated on new features and techniques.
Helps customers strategize and leverage new tools like BigQuery ML.
Troubleshoots any technical issues customers encounter.

Advice:

Start small and focus on mastering one software, language, or area.
Embrace the constant learning and growth in the dynamic field of data science.

Key Takeaways:

Customer engineers serve as technical experts and consultants for clients.
Data science offers automation and intellectual challenges for those who thrive on them.
Continuous learning and specialization are crucial for success in this field.

Remy’s story provides a glimpse into the exciting and ever-evolving world of data science and customer engineering, highlighting the importance of passion, adaptability, and expertise.

Hi, my name is Remy. I’m a customer engineer
for Google Cloud. A customer engineer is
someone who focuses in some technical field, for me, it’s data analytics
and machine learning, and is the expert in
that field and helps their customer answer
questions and build things related to
data analytics. I had no exaggeration around 30 different jobs before I got into the field
that I’m currently in, I was a cashier, a hostess, a cocktail waitress,
an office assistant. Quite frankly, I wasn’t very
good at any of those jobs. If somethings like repetitive, I can’t concentrate
and then I end up making silly mistakes. That’s what’s so great
about data science is a lot of the minutiae
and repetitive tasks, software does it for you or you can write a script to do it. There is definitely
no typical day in the life of a
customer engineer. Usually I will start the
day by looking at emails. I’ll usually have some
questions from a customer and other colleague related to
data that I can help answer. There’s a lot of
informational emails about new features, products,
or techniques. I have to read those and learn about them because
it’s my job to be the person who
knows about all of these things for our customers. I’ll also typically have
a few meetings where I’m working with our account
and customer teams where we try to strategize how
we are going to help customers and get them to do
new things with our tools. A new feature that I’ve
been working with is related to our data
warehouse BigQuery. You can actually create
machine learning models within BigQuery so you don’t have to move the data anywhere. I have a variety
of customers that use BigQuery Machine Learning. So I can now come to
them and share with them this new feature and show them how
it actually works. If while they’re using it, they might have an
issue that comes up, they get some error, I can help troubleshoot
that problem for them. There’s so much going on in this field of data
and it’s also new. It’s going to feel overwhelming. I’ve been doing it for
almost a decade and I still get overwhelmed
on a daily basis. The best thing you can
do is pick a software, pick a language, maybe
a specific area, and just focus on that. Get really good at one
thing and then you can expand from there.

Lab: Exemplar: Address missing data

Reading

Exemplar_Address-missing-data Download

Practice Quiz: Test your knowledge: The challenge of missing or duplicate data

Fill in the blank: Missing data has a value that is not stored for a _____ in a dataset.

variable

Missing data has a value that is not stored for a variable in a dataset. It is typically encoded as N/A, NaN, or a blank.

A data professional requests additional information from a dataset’s original owner. Unfortunately, they are not able to provide the information. Therefore, the data professional creates a NaN category in the dataset. What concept does this scenario describe?

Solving the problem of missing data

This scenario describes solving the problem of missing data. There are four common ways to do this: Request the missing values from the owner of the data; delete the missing columns, rows, or values; create a NaN category; or derive new representative values.

When merging data, a data professional uses the following code:
df_joined = df.merge(df_zip, how=’left’, on=[‘date’,’center_point_geom’])
What is the function of the parameters how and on in this code?

To tell Python which way to join the data and which column to join from

The parameters how and on tell Python which way to join the data and which column to join from. How tells Python which way to join the data, and on tells Python which column to start from.

Non-null count is the total number of blank data entries within a data column.

False

Non-null count is the total number of data entries for a data column that are not blank.

The ins and outs of data outliers

Video: Account for outliers

Notes

Tutorial

Quiz

Transcript

This video covers the importance of recognizing and dealing with outliers in data analysis.

Outliers:

Extreme data points significantly different from others.
Can skew conclusions and models.
Three main types:
- Global: Obvious discrepancies with no association to other data points.
- Contextual: Normal under specific conditions but anomalies otherwise.
- Collective: Group of abnormal points following similar patterns.

Identification:

Visualization: Look for dips, blips, or isolated clusters in graphs.
EDA: Develop a strategy for dealing with outliers.

Decision-making:

Consider context and business plan when deciding how to handle outliers.
Removing them may alter the data story.
Ethically analyze before dismissing outliers.

Example:

Retail data shows a drop in sales of a popular item.
Investigation reveals two salespeople on leave and a new product launch.
This information, not discarding the data, benefits the business.

Further learning:

Finding and working with outliers in Python.

This video emphasizes the importance of careful outlier analysis, considering both technical and ethical aspects, to get the most accurate insights from data.

Here’s a tutorial on outliers in data analysis:

Outliers: Understanding the Unexpected

What are Outliers?

Outliers are data points that significantly differ from the other observations in a dataset, often deviating from the overall pattern or trend.
They can arise due to various reasons, including:
- Measurement errors or data entry mistakes
- Natural variations within a population
- True outliers that represent unique or exceptional cases

Types of Outliers:

Global Outliers:
- Values that are far away from the overall distribution of the data.
- Often considered errors or anomalies and may be removed for analysis.
Contextual Outliers:
- Values that are unusual within a specific context or condition.
- They might be valid data points but need careful interpretation.
Collective Outliers:
- Groups of data points that deviate together from the rest of the dataset.
- May indicate a distinct subgroup or a different underlying process.

Why Outliers Matter:

Outliers can significantly impact statistical analyses and model building.
They can distort measures of central tendency (mean, median) and variability (standard deviation, range).
They can influence model fitting and prediction accuracy, leading to misleading results.

Detecting Outliers:

Visualization Techniques:
- Box plots
- Scatter plots
- Histograms
Statistical Methods:
- Z-scores
- IQR (Interquartile Range) method

Handling Outliers:

Removal:
- Consider removing outliers if they are clearly errors or have a strong negative impact on analysis.
- Use caution and transparency when removing data points.
Transformation:
- Apply mathematical transformations (e.g., log transformation) to reduce the impact of outliers.
Modeling:
- Use robust statistical methods that are less sensitive to outliers.

Ethical Considerations:

Carefully consider the reasons behind outliers and their implications before removing them.
Avoid removing outliers simply to improve results or fit a desired model.
Document any outlier removal or treatment decisions for transparency.

Remember:

Outliers can provide valuable insights into data patterns and underlying processes.
Treat them with caution and consider their context before making decisions.
Employ appropriate methods to detect, analyze, and handle outliers ethically to ensure accurate and reliable data analysis.

Fill in the blank: Outliers are observations that are an _____ distance from other values.

abnormal

Outliers are observations that are an abnormal distance from other values. They may also be observations that are abnormal compared to the overall pattern of the data population.

As a child, did you ever play a puzzle game
in which you are supposed to look
at a picture and find an object that
doesn’t belong? Maybe sometimes the
object was obvious, but other times, it may have been more difficult. For example, if you’re
given a picture of a bustling marketplace full of people selling
fruits, vegetables, and grains, it probably
takes some time to find the item for sale that’s
not in the right place. Similar to visiting
a busy market. When you encounter a dataset
that hasn’t been cleaned, it can be really challenging. Data may at first
seem structured, but it often can be hard
to determine whether there are any data points that are measurably different
from others. These extreme data observations, the ones that stand
out from others, are known as outliers. Outliers are observations that are an abnormal distance from other values or an overall
pattern in a data population. In this video, we will discuss the three main types
of outliers and why it is important for
data professionals to identify them in the
data we analyze. As a data professional, you should be fully aware
of the beginnings and ends, highs and lows, and extreme points of your data across every variable
in the dataset. These values will often
be your outliers. It is essential that
you recognize them, what their value is,
and where they are. Otherwise, they will likely skew any conclusions you draw or
models you build from them. There are three different
types of outliers we will discuss in
this video: global, contextual, and
collective outliers. Later, you’ll learn how to find outliers and work
with them in Python. The first outliers will discuss tend to be the easiest
data points to detect. Global outliers are values that are completely
different from the overall data group and have no association with
any other outliers. They may be inaccuracies,
typographical errors, or just extreme values you typically don’t
see in a dataset. For example, if we had
a set of human heights, 1.7 meters, 1.9
meters, 1.6 meters, 1.8 meters, 7.9 meters, 1.8 meters, or 1.7 meters, the outlier is fairly obvious. Typically, global
outliers should be thrown out to create
a predictive model. Contextual outliers can
be trickier to spot. Contextual outliers
are normal data points under certain conditions but become anomalies under
most other conditions. As an example, movie
sales are expected to be much larger when a
film is first released. If there is a huge spike
in sales a decade later, that would typically
be considered abnormal or a
contextual outlier. These outliers are more
common in time-series data. Another example might
be an outlier only in a specific single
category of data. For instance, 2.5 meters may be a normal enough size for a category called
mammal heights, but if the mammal is
found out to be a mouse, we would likely consider
this an outlier despite it fitting in with
other mammal heights. Lastly, we have
collective outliers. Collective outliers are a group of abnormal points that follow similar patterns and are isolated from the rest
of the population. Think of a parking
lot at a store. It is not uncommon to
have cars or scooters coming and leaving consistently during the hours
a store is open, but to have a full parking lot after the store is closed would be considered a
collective outlier. It could be that there is a company party or a
local event nearby, which would explain
the outlier of cars parked in the store
parking lot after hours. One useful way to find these different types
of outliers in our data is something we’ve
already discussed quite a bit, visualization. It’ll be easier to see any
big dips or giant blips in the data when we plot it on a line graph or a bar chart. No matter how you discover that your data contains
global, contextual, or collective outliers. It is essential that your EDA includes a strategy
for dealing with them. It will be up to you to decide
how outliers need to be represented or whether they need to be removed completely. The decision on what to
do must always be done in the context to the dataset and the business
plan for the data. As mentioned in other videos, always consider the
ethical implications of any decision you
make about outliers. When you’re working as
a data professional, it may be tempting to remove
outliers to improve results, predictions or forecasts. But you will need
to ask yourself, does that change the data story? For example, let’s
imagine that you are a data professional
at a retail business. You review two years of sales data and find
that there’s one month where sales of a
typically popular item are down dramatically. You might at first
assume that there’s a typo in the reported data, but as a diligent
data professional, you ask questions about it. You discover that
several top salespeople were on leave during
that month and your company advertised
a new product which slowed sales on
the existing item. It is likely that both of these issues led to
a drop in sales. Rather than dismissing the
drop in sale as an outlier, this information will be
helpful for the business team. Using this information,
managers can prevent a future drop in
sales numbers ahead of the product launches by hiring additional
sales team members or limiting requested
employee leave during those critical times. Later, you’ll learn how to find outliers and work
with them in Python.

Reading: Protect the people behind the data

Reading

We’ve all been there…

Whether at work or at school, there’s a moment of realization about an essay or a project you’ve been working on—you suddenly realize you’ve made a mistake, and it needs to be fixed.

“But think of all of the trouble that will cause,” you think. “It will make the project late, and everyone will find out I made a mistake.”

Is it really worth the time and effort to stop the process and fix it?

For data professionals, the answer needs to always be YES.

The big picture

Even seasoned data professionals can fail to see the big picture. As intuitive as our brains can be, we often fail to see how a small error, a tiny change, or a seemingly insignificant choice about data analysis can have significant implications on a process, a business, or even a whole population.

To help illustrate this, imagine a grid of alternating orange and blue circles. Now, imagine there is a person who can only see a small section of six circles. What if this person decides that it makes more sense for the circles to be arranged by groups, with orange circles on top and blue on bottom? You’ll find that even though the circles within the rectangle may be perceived as orderly, the reality is the change has actually disrupted the larger pattern. Now imagine the rectangle of six circles is a department and the bigger grid is an entire company.

An ethical mindset

Hopefully, this illustration helps you understand the importance of context and scope when dealing with data. One principle that can help data professionals keep data in context is ethics and developing an ethical mindset.

Here are three important concepts to help you foster an ethical mindset as you seek to tell stories with data:

Models have lives beyond code blocks and data strings.
Data science needs regulatory compliance.
Customers should choose how their data is used.

Models have lives beyond the code blocks and data strings

There can be a tendency to think of data as strictly numbers and math. The reality is that data consists of people-driven inputs. Data professionals should never lose sight of the fact that analysis in all forms—visualizations, models, decisions, and strategies—impacts people.

The end goal of all data analysis and data science work should be to improve the lives of individuals and groups. Next time you are designing a data dashboard or coding a complex algorithm, take a moment to consider how the data might impact human beings, and whether the decisions you make alter that impact for the better. Remember that your own biases may prevent you from being able to anticipate many impacts, so gathering input from a diverse group of team members helps you mitigate your personal biases.

Here is a hypothetical data career example to help you see how this concept may someday impact your work:

Imagine a data professional is analyzing traffic flow data in order to help a city planner for a major metropolitan area decide where to focus on expanding roads. At first glance, using data for road construction appears to be a purely financial decision between a city and a road construction company. The professional’s analysis first led them to select a more hilly, uneven terrain as the most cost-effective place for the expansion of roads. But after analyzing the data with concern for the citizens and families who will be driving on those roads in all seasons, the data professional’s analysis determined the hilly area to have a higher accident risk, and decided to recommend a flatter location for road expansion.

Data science needs regulatory compliance

If you are unfamiliar with the laws and regulations of the industry you work in, your job will not only be much more difficult, but also could land you in legal trouble. A data professional should always remain up to date and in compliance with all regulations in their particular field. As an example, when you begin a new job in the data career space, you can do the following:

Ask your manager for compliance guidance.
Take time to research the data governance policies.
Take time to research the regulatory body of your particular industry and its relevant policy documents.

Keep in mind that the data governance field is quite young and that regulations can and will change at a quick pace. Be sure to keep up with the changes!

It’s important to follow the rules … but not just for rules’ sake. Here are a few important benefits of remaining compliant with regulatory bodies and data governance policies:

Keeps client and company data safe from security threats
Bolsters trust with clients, peer groups, companies, and public
Lessens likelihood of lost or mismanaged data
Ensures business critical data remains usable, accessible, and available

Customers should choose how their data is used

With each passing day, technology gets more advanced and capable. As data becomes increasingly involved in our day-to-day lives, it stands to reason the data itself becomes more and more important and valuable. Because of this, data privacy and security is critical to a company’s clients and a government’s citizens.

As data professionals, it is our ethical responsibility to treat client and customer data with respect and dignity. Companies that store, analyze, and utilize data as part of their business processes should make sure those processes are transparent and compliant. Customers who divulge confidential information in order to procure a service should be able to trust that the data will remain confidential and not compromised due to breaches or cyber security threats.

You’ll find that many companies have begun to treat customer data with more internal care and security. More companies are providing extensive employee training regarding the handling and securing of customer data. Many are adding multi-factor authentication systems and auditing their third-party vendors, requiring high levels of digital security systems and platforms.

As an example, because data breaches like Home Depot in 2014, Uber in 2017, and Instagram in 2020 have been commonplace for decades, corporations are increasingly ramping up cybersecurity, employee and vendor training, and transparency to customers and the public regarding data gathering and data use methods. Specifically, Home Depot has added data security into its risk management plan, and built a data security web page that describes what data they gather and how they use it.

Data transparency is not only good for business, it is the right thing to do—the ethical thing to do. Data professionals need to understand that more than anyone.

Key Takeaways

Data ethics is an expanding field and remains an integral part of a data professional’s daily work. While manipulating, analyzing, or using data in any capacity, remember that data models have lives beyond code blocks and data strings. Data science needs regulatory compliance and customers should always get to choose how their data is used.

Resources for more information

Data governance and data ethics are a growing field of work and study. Here are some resources should you like to learn more:

Video: Identify and deal with outliers in Python

Notes

Tutorial

Quiz

Transcript

Identifying Outliers in Lightning Strike Data:

Goal: Find outliers in 33 years of US lightning strike data.
Data preparation:
- Grouped data by year (1987-2020).
- Created readable labels for large strike numbers.
Initial exploration:
- Calculated mean and median: 26.8 million and 28.3 million.
- Suspected left-skewed distribution based on mean < median.
Visualization:
- Boxplot revealed two outliers below 10 million strikes.
- Calculated IQR and defined upper/lower limits for outliers.
- Scatter plot highlighted outliers in years 1987 and 2019 (red points).
Investigation:
- 2019 data only included December strikes: outlier to exclude.
- 1987 data included all months: outlier to keep.
Re-calculation:
- Excluded outliers and recalculated mean/median (closer together).
Conclusion:
- Identified and analyzed outliers in lightning strike data.
- Demonstrated importance of outlier detection in data analysis.

Key Takeaways:

Outliers can significantly skew data and results.
Visualization tools like boxplots and scatter plots help identify outliers.
Understanding the context of outliers is crucial for handling them appropriately.
Excluding or keeping outliers depends on their origin and impact on the analysis.

This summary captures the main points of the video, covering data preparation, exploration, visualization, investigation, and conclusion. It emphasizes the importance of identifying and analyzing outliers for accurate data analysis.

⚡ Striking Down Outliers: Exploring Anomalies in Lightning Strike Data

Welcome data detectives! Today, we’ll embark on a thrilling investigation: finding outliers in US lightning strike data spanning 33 years. Prepare to uncover hidden patterns, understand data distributions, and wield powerful Python tools to expose these statistical anomalies.

The Case of the Mysterious Strikes:

Our dataset holds the total lightning strikes recorded in the US from 1987 to 2020. But can we trust every data point? Enter outliers: unusual data points that deviate significantly from the majority. Identifying them is crucial for accurate analysis and avoiding misleading conclusions.

Our Toolkit for Unmasking Outliers:

Python Libraries: We’ll use Pandas, NumPy, Seaborn, and Matplotlib for data manipulation, analysis, and visualization.
Statistical Measures: Mean, median, interquartile range (IQR), quantiles – these statistics will unveil data distribution trends and potential outliers.
Boxplots and Scatterplots: These visual powerhouses will showcase data distribution and highlight outliers for closer inspection.

The Investigation Begins:

Importing the Data and Libraries: Start by importing our trusty libraries and loading the lightning strike data into a Pandas DataFrame.
Data Wrangling: Let’s make the data easier to read and analyze. Use functions like .head() to preview the data and consider scaling large numbers for better visualization.
Mean and Median: Unveiling the Center: Calculate the mean and median of the total lightning strikes. Are they close or far apart? This hints at the data distribution (normal, skewed, etc.).
Boxplot: Unveiling the Outliers: Generate a boxplot to visualize the data distribution. Look for data points beyond the whiskers – those are potential outliers!
Interquartile Range and Quantiles: Calculate the IQR and identify the 25th and 75th percentiles (quartiles). This helps define the “normal” range of data.
Outlier Thresholds: Set upper and lower limits based on 1.5 times the IQR. Any data point beyond these limits is considered an outlier.
Identifying the Culprits: Use logical indexing to isolate data points below the lower limit – our identified outliers!
Scatterplot: Seeing Outliers in Action: Create a scatterplot with year as the X-axis and number of strikes as the Y-axis. Color outliers for easy identification.
Investigating Individual Outliers: Zoom in on specific outliers (e.g., 1987 and 2019). Check for missing data, data entry errors, or unusual events that might explain the anomaly.
Cleaning the Data: Decide how to handle outliers. Exclude them if they significantly skew the analysis, or adjust values if justified.
Recalculating Statistics: Recalculate the mean and median excluding outliers. Observe the difference in data distribution.

Conclusion:

By exploring statistical measures, visualizations, and investigating individual cases, you’ve successfully identified and analyzed outliers in your lightning strike data. Remember, understanding outliers is critical for drawing accurate conclusions and making informed decisions based on your data.

Bonus:

Explore different outlier detection algorithms like Z-score or Median Absolute Deviation (MAD).
Learn about outlier treatment methods like winsorization or data transformation.
Apply these techniques to analyze other datasets and become a master of data anomaly detection!

So, grab your Python tools, put on your data detective hat, and get ready to strike down those outliers!

Docstrings are useful within a line of Python code, but they cannot be exported to create library documentation.

False

Docstrings are lines of text following a method or function that explain to others what that method or function does. They can also be easily exported to create library documentation.

Earlier, you learned about
three types of outliers, global, contextual,
and collective. Now that we discussed outliers
on a conceptual level, I’ll show you how
to identify them and analyze their impact
in a Python notebook. Our goal will be to
identify outliers across a 33-year span of total lightning strike
counts in the United States. As usual let’s start by importing our libraries
and packages, pandas, numpy, seaborne,
and pyplot from matplotlib. We’ll continue using our NOAA
lightening strike dataset. This time we’ll group
the total sum of the lightning strikes in the United States by
year from 1987-2020. Let’s first reveal
the top 10 rows of data using the head function. The dataset has two
numeric columns, year and number of strikes. As you’ll notice,
we’re dealing with some fairly large numbers for the lightening
strike totals. It would be helpful
to make them a bit shorter and easier to
read in visualizations. To do that, let’s write a
readable numbers function. Below the function, there is a long explanatory text inside a triple set
of quotation marks. This is called a documentation
string or docstring. A documentation string
or docstring is a line of text
following a method or function that is used to
explain to others using your code what this
method or function does. A docstring represents good documentation
practices in Python. It makes the code easier
to understand and can be easily exported to create
library documentation. A docstring was already provided for this readable
numbers function. Next we’ll form an
if else statement. We want to code any number above six digits to be formatted
to one decimal place, followed by an M. Under
the else statement we are formatting numbers with
more than three digits to the same one decimal place, followed by the letter K.
Under these statements, we define the new
DataFrame we want titled number of
strikes readable. We then use the apply function to set our readable numbers statement to only the number of strikes column
in the DataFrame. Using the head function, we’ll see the output
of our newest code. The new column has readable
values of 15.6 million, 209,000 and 44.6 million
in its first three rows, where before they were just long strings
of numeric digits. This indicates that we coded the read numbers
function correctly. Now that we have more
readable values, let’s plot our data. As you’ll recall, our goal is
to find any outliers among the total number of
lightning strikes in 33 years of NOAA data. One way to find any
outliers would be to investigate the mean and
median of this data. The mean is the
average number of the given values and the median is the middle point
of the given values. When we code for
these two values, we use the numpy functions
of mean and median. Were using the readable
numbers for simplicity. The resulting
output is the mean, 26.8 million and the
median 28.3 million. With the mean being two million strikes less than the median, we suspect the data distribution is likely skewed to the left. The left side of the distribution will be a
good place to investigate. One effective way to visualize
outliers is a boxplot. As we discussed in
a previous video, a boxplot divides
the distribution of data points into four main
quadrants or quartiles. The boxplot is helpful
in visualizing and confirming the outliers
we’ve already found. Let’s plot the boxplot first and then we’ll talk
about it in more detail. To design a boxplot
using Python, we’ll use the seaborne
boxplot function. We’ll use the number
of strikes column from our DataFrame as the
basis for the plot. Next, we’ll use
set x tick labels based on the readable
numbers statement we created already. X tick labels simply
allow us to give names to the x-axis marks so that they are
read more easily. Lastly, we input the x and
y labels, and the title. You’ll recall that the purpose
of a boxplot is to show the distribution of values separated into
quadrants or quartiles. What we’re most focused
on with our plot are that the two points beyond the far left of the line
are below 10 million. This data visualization
is showing us very clearly there are outliers included in our group
of data points. The blue rectangles
in your boxplot are called the interquartile range. This range represents
the difference between the third quartile and the first quartile of
a set of data values. A standard rule statistics which you will learn more
about another course is that any data point
that falls beyond 1.5 times the blue boxes
are considered outliers. This next piece
of code will help us define what value is at 1.5 times below our
interquartile range. We’ll use the quantile function
to define what we’ll call percentile 25 and percentile 75. We enter our data column, number of strikes
for our DataFrame, and then 0.25, and 0.75 in the argument fields. The values that occur between the 75th and 25th percentile
are the interquartile range. We create a statement for
interquartile range, IQR, defining percentile 75 minus percentile 25 is equal to IQR. Now, let’s define
two statements; upper limit and lower limit. Upper limit is equal to
percentile 75 plus 1.5 times IQR. Lower limit is percentile
25 minus 1.5 times the IQR. Lastly, let’s have Python provide the exact value
of the lower limit. We’ll input print,
followed by the text. Lower limit is colon plus
sign readable numbers, and lower limit in
the argument fields. After running the cell, we learn that the lower
limit is 8.6 million. Don’t worry if
you’re still trying to understand
interquartile range. This concept will be explored further in depth
later in the program. Next, let’s plot
the values that lie below our lower limit
of 8.6 million. Place the number of strikes column inside the
brackets next to df. Then use the less than symbol followed by the lower
limit we just derived. Now here are the outliers. One great way of seeing these
outliers in relation to the rest of the data points
is a data visualization. For this plot, let’s
do a scatter plot. You may recall that a scatter plot represents
relationships between different variables with
individual data points without a connecting line. To start the scatter plot, let’s first add labels to
each point on the plot. We’ll define them using the add labels function
for x and y points. The plt.text function allows us to define how we want the text of each data
point to appear. In this case, we tell the
texts to start 0.05 pixels to the left of the data
point and just start 0.5 pixels above the datapoint. We also defined the
number of values using our readable number statement from the beginning of the video. Next, let’s add color
to the data points. Let’s make the outlier points red and the other points blue. To do that, we’ll define
colors by the lower limit. For those that fall below our previously
defined lower limit, we’ll code that they
should be r for red, and everything else will
code as b for blue. Our next line of code
should feel familiar. We start by configuring
our visualization size. In this case, we code
16 and eight inches, which are the default
units for this function. Next is our code for
scatter plots, ax.scatter. The ax refers to axis, which tells the notebook to plot the data on an x
versus y data graphic. Inside the argument field, we first input the x-axis
and then the y-axis, which is year and number
of strikes, respectively. For the rest of the parameters, we fill in the x and y
label and the title. We finish the parameters
of our scatter plot by defining the x axis
tick labels and setting the rotation
of the x-axis ticks to a rotation
of 45 degrees. The resulting plot shows
us the years 1987 and 2019 are the two values
in red or two outliers. Now that we’ve narrowed our
scope of the outliers to just two years,1987 and 2019, we can do a little
more investigating. Let’s start with the 2019
lightning strike data. If we import just the 2019 data, the first thing
we’ll want to do, as we learned in a
previous video is convert our date
column to date-time. Next, we’ll create
some new columns in our dataframe month
and month text. You’ll notice for the date, we use the str.slice function to cut the month names down to
just the first three letters. Finally, we create
a dataframe that groups the total number of lightning strikes
for each month. The result explains why
the year is an outlier. For the 2019 dataset only lightning strikes from the month of December
had been inputted. If you were a data
professional tasked with calculating lightening
strike risk, you would first research the NOAA documentation available to see if there’s
a reason given. If you don’t find anything
based on your research, you would then ask the
NOAA why lightning strikes haven’t been recorded for
the other 11 months of 2019. When we do the same
code from 1987, the resulting output is
different from 2019. We find there are
lightning strikes recorded for each month in 1987. This difference that 2019 does not have data
for all months, and that 1987 does have
data for all months, helps us to know how to
handle these two outliers. For 2019, it would make sense to exclude it from our analysis. As for 1987, we
recognize that it is indeed an outlier
but it should not be excluded from our analysis
because it does have lightening strike
totals included for each month of the year. Now, before we end,
let’s return to the mean and median values
that we calculated earlier. First, let’s create
a new data-frame, including only data points that exclude the outliers
we identified. Our goal here is to
show the mean and median of the dataset
excluding the outliers. To do this, we’ll
use a data-frame without outliers we just created and our readable numbers
statement to only include data points that are greater than our lower limit. When we exclude those outliers, the mean and median are
much closer together, suggesting a fairly evenly
distributed dataset. Now, let’s revisit
the goal we set earlier: to identify outliers across the 33-year span of total lightning strike
counts in the United States. We discussed a lot of new concept and we
achieved this goal.

Reading: Reference guide: How to handle outliers

Reading

Reference-guide_-How-to-handle-outliers-Download

Practice Quiz: Test your knowledge: The ins and outs of data outliers

What type of outlier is a normal data point under certain conditions, but becomes an anomaly under most other conditions?

Contextual outlier

A contextual outlier is a normal data point under certain conditions, but becomes an anomaly under most other conditions.

The answer is Contextual outlier.

Contextual outliers are data points that appear unusual when viewed in the context of the entire dataset, but are normal within a specific subset or context. These outliers are often the result of specific conditions or circumstances that only apply to a certain group of data points.

Here’s a breakdown of the other options:

Constant outliers are data points that are consistently anomalous across all contexts, indicating potential errors or extreme events.
Collective outliers are groups of data points that deviate from the rest of the dataset as a whole, suggesting a unique pattern or behavior within that group.
Global outliers are data points that are significantly different from the majority of the data, regardless of any specific context.

Understanding the different types of outliers is crucial for effective data analysis and decision-making. Contextual outliers, in particular, highlight the importance of considering context when interpreting data and identifying anomalies.

What is the term for a line of text that follows a method or function, which is used to explain the purpose of that method or function to others using the same code?

Docstring

A docstring is a line of text that follows a method or function, which is used to explain the purpose of that method or function to others using the same code.

A data professional is using a box plot to identify suspected high outliers in a dataset, according to the interquartile rule. To do that, they search for data points greater than the third quartile plus what standard of the interquartile range?

1.5 times

The answer is 1.5 times the interquartile range (IQR).

Interquartile Rule for Identifying Outliers:

Calculate the IQR: Subtract the first quartile (Q1) from the third quartile (Q3).
Find the lower bound: Subtract 1.5 times the IQR from Q1.
Find the upper bound: Add 1.5 times the IQR to Q3.
Identify outliers: Data points below the lower bound or above the upper bound are considered potential outliers.

Specifically for high outliers:

Data professionals look for points greater than Q3 + 1.5 * IQR.

Key Points:

The IQR represents the middle 50% of the data’s spread.
The 1.5 multiplier is a common convention for outlier detection, but adjustments might be made based on domain knowledge and analysis needs.
Outliers can be legitimate data points or errors. Further investigation is often required to determine their cause and significance.

Change categorical data to numerical data

Video: Sort numbers versus names

Notes

Tutorial

Quiz

Transcript

Summary of Categorical Data and Encoding Techniques:

What is categorical data?

Data divided into a limited number of qualitative groups (e.g., occupation, ethnicity).
Often represented in words and has a limited number of possible values.

Challenges with categorical data:

Many data models and algorithms don’t work well with it compared to numerical data.
Needs conversion to numerical data for tasks like prediction, classification, and forecasting.

Two common encoding methods:

1. Dummy variables:

Create new binary variables (0/1) indicating presence/absence of a category.
Easier to interpret in statistical models and machine learning algorithms.
Example: “mild” category in a dataset becomes a “1” while other values are “0”.

2. Label encoding:

Assign unique numbers to each category instead of qualitative values.
Simplifies data cleaning, joining, and grouping.
Reduces storage space and improves algorithm/model performance.
Example: Mushroom types in a dataset get assigned numbers (“black truffle” = 0, “button” = 1, etc.).

Choosing the right encoding method:

Depends on the business need, model/algorithm used, and data characteristics.

Remember:

Data encoding is crucial for preparing categorical data for use in models and algorithms.
Choosing the correct method ensures efficient data analysis and accurate results.

Key takeaway:

Like having the right charger for your phone, using the proper data encoding technique ensures your data is ready to use for powerful analysis and insights.

Here’s a tutorial on Categorical Data and Encoding Techniques:

Understanding Categorical Data:

Definition: Data that represents qualitative descriptions rather than numerical values.
Examples: Gender (male/female), occupation (doctor, teacher, engineer), education level (high school, bachelor’s, master’s)
Importance: Often used to capture descriptive characteristics in various datasets.

Challenges with Categorical Data:

Numerical Preference: Many data models and algorithms are designed to work primarily with numerical data.
Conversion Necessity: Categorical data often needs to be converted into numerical representations for effective analysis and modeling.

Encoding Techniques:

Dummy Variables (One-Hot Encoding):
- Create new binary variables (0 or 1) for each unique category.
- Each observation has a “1” in the column representing its category and “0” in all other category columns.
- Example: Gender (male/female) becomes two columns: “Male” (1 for males, 0 for females) and “Female” (1 for females, 0 for males).
Label Encoding:
- Assign unique numerical values to each category in a column.
- Simpler approach, but the assigned numbers don’t inherently have any meaning beyond representing categories.
- Example: Education level (high school = 0, bachelor’s = 1, master’s = 2).

Choosing the Right Technique:

Consider model type: Some models work better with dummy variables (e.g., linear regression), while others can handle label encoding (e.g., decision trees).
Avoid numerical assumptions: Label encoding can introduce unintended ordinal relationships if not used carefully.
Consider data characteristics: The number of categories and their relationships can influence the choice.

Additional Considerations:

Ordinal Encoding: For ordinal categorical data (categories with a natural order), numerical values can be assigned based on their order.
Target Encoding: Encoding categories based on their relationship with the target variable in predictive modeling tasks.

Best Practices:

Experiment with different techniques: Evaluate model performance with various encoding methods to find the best fit.
Consider domain knowledge: Incorporate understanding of the data and business problem to guide encoding choices.
Document decisions: Clearly record the encoding methods used for reproducibility and understanding.

Remember:

Data encoding is a crucial step in preparing categorical data for analysis and modeling.
Choose encoding techniques thoughtfully to ensure accurate and meaningful results.

Categorical data can be grouped on its qualities, thus enabling data professionals to store and identify it based on its category.

True

Categorical data can be grouped depending on its qualities, thus enabling data professionals to store and identify it based on its category.

Have you ever tried to charge your
phone with the wrong type of plug? It is a frustrating problem but
definitely not an insurmountable one. You will probably experience a similar
frustration when you work with categorical data. Categorical data is data that is
divided into a limited number of qualitative groups. For example, demographic information
tends to be categorical, like occupation, ethnicity and
educational attainment. Another way to think about categorical
data is data that uses words or qualities rather than numbers. As a data professional, you will likely work with datasets
that contain categorical data. Categorical data entries can typically
be identified quickly because they are often represented in words and
have a limited number of possible values. Many data models and
algorithms don’t work as well with categorical data as they
do with numerical data. Assigning numerical representative
values to categorical data is often the quickest and most effective way
to learn about the distribution of categorical values in a data set. There are some algorithms that work well
with categorical variables in word form, like, decision trees, which you’ll
learn about in another course. However, with many data sets, the categorical data will need to
be turned into numerical data. This conversion is often essential for predicting classifying forecasting and
more. There are several ways to change
categorical data to numerical. In this video we will focus
on two common methods, creating dummy variables and
label encoding. One way to work with categorical
data is to create dummy variables to represent those categories? Dummy variables are variables
with values of 0 or 1, which indicate the presence or
absence of something. It helps to open a data set
with dummy variables already created to understand
what their function is. You’ll find in this data set that for any value that has been determined
to be mild, a 1 has been input. All other values in the mild
column are labeled with 0s. The same goes for
other categories and values. You can think of it as the one represent
a yes, and the zeroes representing no. Dummy variables are especially
useful in statistical models and machine learning algorithms. Another way to work with categorical
data is called label encoding. Label encoding is a data transformation
technique where each category is assigned a unique number
instead of a qualitative value. For instance, let’s imagine you had a data
set about mushrooms, in which you wanted to understand the distribution
among different types of mushrooms. Imagine you’ve been given a data set about
mushrooms with a column titled type, and the options of black truffle,
cremini, king trumpet, button, hedgehog, morel,
portobello, toadstool or shiitake. You could use label encoding to transform
each mushroom type into a number. Black truffle would be changed to a 0, button would become a 1,
cremini a 2, hedgehog, a 3, king trumpet a 4,
morel 5, Portobello 6, shiitake 7, and
lastly toadstool would be 8. So why do data professionals
use label encoding? Data is much simpler to clean, join and group when it is all numbers,
it also takes up less storage space. An algorithm or model typically runs
smoother when we take the time to transform our categorical
data into numerical data. For example, let’s say you try to run
a prediction model using the mushroom dataset which will try to anticipate
the percent chance a new mushroom introduced to the data is a Portobello. If we try to create a model without
first performing the label encoding, the prediction model will not function. Of course there will be models and algorithms in which you may not
want to perform label encoding. Let the business need be
your guide on whether or not you’ll need to perform label encoding. The type of model or algorithm you
choose will determine whether or not you encode labels. You’ll learn more about algorithms and
models in an upcoming course. Much like making sure you have the right
phone charger to plug into your phone, as a data professional, you need to make sure your data is
ready to use with models or algorithms.

Video: Label encoding in Python

Notes

Tutorial

Quiz

Transcript

Topic: Label encoding in Python for categorical data

Dataset: NOAA lightning strike counts (2016-2018)

Steps:

Import libraries: datetime, pandas, seaborn, plotly
Prepare data:
- Convert date to datetime format.
- Extract month name (first 3 letters).
- Create month group and year column.
- Group data by year and month, sum number of strikes.
Label encode:
- Create strike level column (Mild, Scattered, Heavy, Severe) using qcut.
- Assign numeric codes to each strike level using .cat.codes.
Create dummy variables:
- Use get_dummies to convert strike level categories to binary columns (ones and zeros).
Visualize data:
- Reshape data using pivot (year as index, month as columns, strike level code as values).
- Create a heatmap using seaborn.heatmap to show distribution of strike levels across months and years.

Outcome:

Categorical data transformed into numerical data and dummy variables for easier visualization and analysis.
Heatmap reveals months with the highest concentrations of severe lightning strikes.

Key takeaways:

Label encoding allows working with categorical data for visualizations and models.
Pandas and Seaborn libraries facilitate data manipulation and visualization.
Label encoding is a necessary skill for data professionals.

Note: This summary focuses on the main points of the video. Some details and explanations might be omitted.

Here’s a tutorial on label encoding in Python for categorical data:

What is Label Encoding?

It’s a technique for transforming categorical data (text labels) into numerical data (integers).
This is often necessary for machine learning algorithms that require numerical inputs.

Why Use Label Encoding?

Machine Learning Compatibility: Many algorithms can’t directly handle text-based categorical data.
Visualization: Numerical representations can be easier to visualize and plot.
Feature Engineering: Label encoding can be a step in preparing categorical features for modeling.

Steps for Label Encoding in Python:

Import Libraries:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

Load Data:

df = pd.read_csv("your_data.csv")

Identify Categorical Columns:

categorical_cols = ["color", "size", "category"]  # Replace with your column names

Create LabelEncoder Instances:

encoders = {}
for col in categorical_cols:
    encoder = LabelEncoder()
    df[col] = encoder.fit_transform(df[col])
    encoders[col] = encoder  # Store encoders for later use

Access Numerical Representations:

print(df["color"])  # Will now show integers instead of text labels

Reverse Transformation (Optional):

original_labels = encoders["color"].inverse_transform([0, 1, 2])  # Example

Key Points:

Label encoding assigns arbitrary integer values to categories.
It doesn’t imply any inherent order or relationship between categories.
Consider ordinal encoding or one-hot encoding for cases with inherent order or multiple categories per feature.
Handle missing values appropriately before encoding.

Example:

Python

data = ["red", "green", "blue", "red", "green"]
encoder = LabelEncoder()
encoded_data = encoder.fit_transform(data)
print(encoded_data)  # Output: [0 1 2 0 1]

Fill in the blank: A heat map uses _____ to depict the magnitude of an instance or set of values.

colors

A heat map uses colors to depict the magnitude of an instance or set of values. Heat maps are a type of data visualization that is useful for showing the concentration of values between different data points.

In a previous video, you learned about
the importance of label encoding, also known as transforming
categorical data into numerical data. Now it’s time to open our python
notebooks and learn how to do it. Let’s continue with the same NOAA
dataset we’ve been using for lightning strike counts. For this notebook, we’ll look at
data from 2016, 2017, and 2018. Let’s start by importing
python libraries and packages. For this notebook,
let’s import date time pandas, seaborn and pi plot from matt plot lib. As you learned previously,
start by converting the date column into date time,
which makes it easier to manipulate. Next, create a column called month. Like we did earlier, we will use str.slice to cut the month names down to
only the first three letters. When working with this data, it is helpful to make sure the month
names stay in chronological order. Our next line of code then is months, which simply defines
the group of months in order. You’ll find this helpful on the next line
of code where we use the pandas function categorical to group the number of strikes
column by the month column because we have three years worth of data. Let’s also create a column
listing the year. To do this, we name the column year,
using the date time function strf time. This function will return a string
representing only the year in the original date column. Inside the argument field,
we input percent sign y indicating we only want the year from
the date time string. Finally, let’s create a data
frame called df by month, which groups the number of strikes
first by year and then month. The last part of the code here is
making sure that the number of strikes is added together for
each month using the sum method. We’ll tack on the reset index function
to tell the notebook to recount the rows from zero without
using its initial order. When we look at the top of the data frame,
using the head function, we get the first five rows of
data with columns of year, month, and number of strikes. Let’s create a column for
categorical variables called strike level. In this column, we will group or
bucket the total number of strikes into categories Mild,
scattered, Heavy, and Severe. This is how we perform
label encoding in python. To do this will create a new column in R df by month data
frame called strike level. This new column will be coded
using the pandas function qcut the queue in this function
refers to quantiles. Meaning that when using
these pandas function, you can cut the data into
four equal quantized. First parameters, we first input
the desired column we want to cut into quantiles, which will be
the column called number of strikes. Then we input the number of quantiles we
want, which is 41 for each classification. We type in those classifications
under the label parameter, Mild, scattered Heavy, and Severe. Let’s use the head function to
find out how our code modified the data frame df by month. It may seem counterintuitive to
create categorical data in order to then transform it to numerical data. But this process will allow us to
segment the data into useful chunks, which can then be plotted on
a graph in interesting ways. Not to mention, string categories
are easier to understand in a data visualization than numbers. One other column which will help us
later is to assign numerical values to the strike levels. We just defined as mild,
scattered, heavy, and severe. We’ll call this new
column strike level code. We’ll define it by taking the column
strike level and adding .cat.codes. Cat codes is a pen dysfunction
that takes categories and assigns them a numeric code. In this case, when we run the cell we
find that mild has been assigned a 0, scattered assigned a 1,
heavy a 2, and severe a 3. And that’s how we perform label encoding. Having both the categorical terms for
different levels of lightning strikes and the numeric codes to go with them,
we can create data visualizations or models that are straightforward
to interpret. One other helpful way to work with
categorical data is to create dummy variables to
represent those categories. As you’ll recall from a previous video,
dummy variables are variables with values of 0 or 1, which indicate
the presence or absence of something. The pandas function will use
to achieve this in our python notebooks is called get dummies. To clarify, new columns of ones and
zeros are also known as dummies. The get_dummies function converts
these categorical variables, mild, scattered, heavy, and
severe into numerical or dummy variables from the column
called strike level. If we run the cell, you’ll find this
function creates four new columns and puts a one in any row where
the number of lightning strikes falls into the labeled category,
and a zero for anything else. Now that we have our new columns of
categorical data, numeric data, and dummy variables, let’s discover what you can
potentially learn from these groupings. To do that, let’s plot our data. We’ll first create a data
frame called df_by_month_plot. Let’s reshape the data by using
the pandas function, pivot. Pivot allows us to reshape our
data frame into any given index or set of values we want based on
the parameters inside the pivot function. Those parameters are index columns and
values in that order. To help us with the visualization,
let’s use year in the index field, month in the columns field, and
strike level code in the values field. If we use our head function,
we find all three years in the rows, the months in the columns and the strike level categories in each month,
just like we coded. Next, let’s visualize our data
using what is called a heatmap. So we have a better idea of where most
of the lightning strike values are. A heatmap is a type of data visualization
that depicts the magnitude of an instance or
set of values based on two colors. It is a very valuable chart for showing the concentration of values
between two different points. Let’s create a heatmap using python. Using the heatmap plot from seaborn, we can place df_by_month_plot
in the parameter field for the data along with sea map equals blues. This sea map definition will give
us a preset color gradient from dark blue to white. Next, we fill in some of the parameters to
help with understanding the visualization. In this case, we’ll use the standard
collection zero from seaborn for our color bar. This will just give us
a preset gradient color bar, which saves us a lot of time from having
to code our own colors on a color bar. Next, we’ll set the color bar
ticks to zero through three and label those ticks according to
our four strike levels, severe, heavy, scattered, and mild. Lastly, when we code the show function,
we are given a beautiful heatmap. Using the heatmap, we can find the most severe lightning
strike months across all three years. Thanks to Pandas and Seaborn
transforming data into categories and back into numerical data is possible with
just a few lines of code as we discussed. It’s easier to plot a visualization
by using strike level code as the base values. The categories we created of mild,
scattered, heavy, and severe became
the labels on our heatmap plot. Label encoding can be as simple as that,
and it is an essential skill for data professionals.

Reading: Other approaches to data transformation

Reading

As you know, data comes to us in many different forms. For categorical or qualitative data types, data professionals often need to transform (or encode) this type of data into numeric digits to complete their analysis, design their data visualization, or build their machine learning algorithm. In this reading, you will learn about the two main types of categorical data encoding and when to use each type.

Label encoding

You’ve already learned about label encoding, which is one type of data transformation technique where each data value is assigned a distinct number instead of a qualitative value.

If you remember from the video, the example provided was label encoding mushroom types:

Mushroom Type	Code
Black truffle	0
Button	1
Cremini	2
Hedgehog	3
King Trumpet	4
Morel	5
Portobello	6
Shiitake	7
Toadstool	8

As you can tell, for this hypothetical data set about mushrooms, each mushroom type was assigned its own number, starting with zero.

Some potential problems with label encoding

Imagine you’re analyzing a dataset with categories of music genres. You label encode “Blues,” “Electronic Dance Music (EDM),” “Hip Hop,” “Jazz,” “K-Pop,” “Metal,” “ and “Rock,” with the following numeric values, “1, 2, 3, 4, 5, 6, and 7.”

With this label encoding, the resulting machine learning model could derive not only a ranking, but also a closer connection between Blues (1) and EDM (2) because of how close they are numerically than, say, Blues(1) and Jazz(4). In addition to these presumed relationships (which you may or may not want in your analysis) you should also notice that each code is equidistant from the other in the numeric sequence, as in 1 to 2 is the same distance as 5 to 6, etc. The question is, does that equidistant relationship accurately represent the relationships between the music genres in your dataset? To ask another question, after encoding, will the visualization or model you build treat the encoded labels as a ranking?

The same could be said for the mushroom example above. After label encoding mushroom types, are you satisfied with the fact that the mushrooms are now in a presumed ranked order with button mushrooms ranked first and toadstool ranked eighth?

In summary, label encoding may introduce unintended relationships between the categorical data in your dataset. When you are making decisions about label encoding, consider the algorithm you’ll apply to the data and how it may or may not impact label encoded categorical data.

Fortunately, there is another method for categorical encoding that may help with these potential problems.

One-hot encoding

As you learned in a previous video, you can create dummy variables in Python. If you remember, a dummy variable is a variable with values of 0 or 1, which indicate the presence or absence of something. The idea is to create a new column for each category type, then for each value indicate a 0 or a 1 — 0 meaning, no, and 1 meaning, yes.

This creation of dummies is called one-hot encoding. As a reminder, a table with one-hot encoding ends up like this:

N/A	Mild	Scattered	Heavy	Severe
0	1	0	0	0
1	1	0	0	0
2	0	1	0	0
3	0	0	1	0
4	0	0	0	1
5	0	0	0	1
6	0	0	0	1
7	0	0	0	1
8	0	0	1	0
9	0	1	0	0
10	1	0	0	0
11	1	0	0	0
12	0	1	0	0

You’ll find the values from the lightning strike dataset covered in the video are labeled as “mild” and have a “1.” “Mild” refers to the lowest quartile of lightning counts in the dataset. For any other value in the “mild” column that is NOT mild, there is a zero in that cell. With this method, we solve the problem of the unintended and problematic relationships that label encoding presented.

But one-hot encoding does present its own set of problems, particularly when it comes to logistic and linear regression. You will learn more about that in a future course.

Label encoding or one-hot encoding: How to decide?

There is no simple answer to whether you should use label encoding or one-hot encoding. The decision needs to be made on a case-by-case, or dataset-by-dataset basis. But there are some guidelines to help you.

Use label encoding when:

There are a large number of different categorical variables — because label encoding uses far less data than one-hot encoding
The categorical values have a particular order to them (for example, age groups can be grouped as youngest to oldest or oldest to youngest)
You plan to use a decision tree or random forest machine learning model

Use one-hot encoding when:

There is a relatively small amount of categorical variables — because one-hot encoding uses much more data than label encoding.
The categorical variables have no particular order
You use a machine learning model in combination with dimensionality reduction (like Principal Component Analysis (PCA))

Key takeaways

Label encoding and one-hot encoding are techniques for transforming categorical data to numeric data. Label encoding is best for large numbers of different categorical variables and for categories that have an inherent order to them. One-hot encoding is best for smaller amounts of categorical variables and for categories that have no order.

Reading: Reference guide: Data cleaning in Python

Reading

Reference-guide_-Data-cleaning-in-Python Download

Practice Quiz: Test your knowledge: Changing categorical data to numerical data

Fill in the blank: Label encoding assigns each category a unique _____ instead of a qualitative value.

number

Label encoding assigns each category a unique number instead of a qualitative value. This process enables data professionals to more effectively work with categorical data.

The answer is number.

Explanation:

Label encoding is a technique used to transform categorical data into numerical form, enabling its use in machine learning algorithms that require numerical inputs.
It assigns each distinct category a unique integer value, typically starting from 0.
This conversion facilitates mathematical operations and pattern recognition within the data.

Incorrect options:

String: Label encoding involves numerical representation, not strings.
Qualifier: Qualifiers describe qualities, while label encoding focuses on numerical assignment.
Character: Characters are individual symbols, not suitable for representing entire categories.

When working with dummy variables, data professionals may assign the variables an infinite number of values.

False

Dummy variables have a value of 0 or 1. They are used to indicate the absence (0) or presence (1) of something.

The answer is False.

Explanation:

Dummy variables (also known as indicator variables or one-hot encoding) are used to represent categorical data with multiple categories.
They create a new binary variable (0 or 1) for each category, indicating the presence or absence of that category for a given observation.
The number of dummy variables created is always finite, equal to the number of unique categories minus 1 (to avoid multicollinearity).

Key points:

Each dummy variable can only take on two values: 0 or 1.
The number of dummy variables is limited by the number of categories in the original categorical feature.
Infinite values would violate the concept of dummy variables and lead to issues in statistical analysis and modeling.

Which pandas function does a data professional use to convert categorical variables into dummy variables?

get_dummies()

The get_dummies() function is used to convert categorical variables into dummy variables.

The answer is get_dummies().

Explanation:

get_dummies() is the pandas function specifically designed to create dummy variables from categorical data.
It takes a DataFrame or a Series as input and returns a new DataFrame with dummy-coded columns for each category.

Incorrect options:

convert_dummies(): Not a built-in pandas function.
get_categories(): Retrieves the categories of a categorical variable, but doesn’t create dummy variables.
convert_categories(): Not a standard pandas function for this purpose.

Key points:

get_dummies() is essential for handling categorical data in machine learning and statistical analysis.
It enables algorithms that require numerical inputs to work with categorical variables.
It helps capture relationships between categorical features and target variables.

Input validation

Video: The value of input validation

Notes

Tutorial

Transcript

Summary of the Video on Input Validation and Joining in EDA:

Key Points:

Input Validation:
- Crucial for clean, error-free data, leading to accurate analysis and model performance.
- Performed throughout the EDA process, not just once.
- Checks for completeness, accuracy, format consistency, and data type alignment.
- Like a chef checking and rechecking vegetables before cooking.
Joining:
- Augments data by adding values from other datasets.
- Effective only with validated data (similar formats, data types).
- Similar to a chef choosing compatible vegetables for a new recipe.
- Requires analytical thinking and adapting approaches based on specific datasets.
Benefits:
- Prevents system crashes, coding issues, and wrong predictions.
- Ensures ethical data handling and accurate business decisions.
- Aligns with the PACE workflow (Plan, Analyze, Construct, Execute).
Tips:
- Ask questions about format, range, and data type consistency.
- Collaborate with peers and managers for peer review and bias prevention.
- Invest in thorough validation for better analysis and fewer future problems.

Overall Message:

Input validation and joining are essential practices in EDA, requiring careful attention to ensure clean, accurate data. Just like a chef meticulously selects and preps ingredients, thorough data validation sets the stage for successful analysis and avoids future headaches.

Input Validation and Joining in EDA: A Practical Tutorial

Data is the lifeblood of analysis, but dirty data can lead to messy conclusions. Input validation and joining are two crucial practices in Exploratory Data Analysis (EDA) that ensure your data is reliable and ready for the big questions. This tutorial will equip you with the tools and techniques to validate your data and combine it effectively for deeper insights.

Part 1: Input Validation – Ensuring Data Integrity

Know your Data: Before diving in, understand the data source, format, and expected values. Ask questions like:
- What types of data are present (numerical, textual, dates)?
- Are there any expected ranges or limitations for these values?
- How is the data structured (columns, rows, format)?
Inspect with Care: Use visual and statistical tools to identify anomalies:
- Descriptive statistics: Calculate measures like mean, median, standard deviation to spot outliers.
- Distribution plots: Visualize the spread of data using histograms and boxplots.
- Data profiling: Look for missing values, duplicate entries, and inconsistencies in format.
Cleanse and Correct: Address the issues you identified:
- Impute missing values: Use appropriate methods like mean or median imputation.
- Standardize formats: Ensure consistent date formats, units of measurement, etc.
- Handle outliers: Decide whether to remove, adjust, or investigate further.
Repeat and Refine: Validation is an iterative process. Re-run checks after cleaning and throughout your analysis to maintain data integrity.

Part 2: Joining – Combining Data for Richer Insights

Identify the Need: Determine what questions you want to answer and which datasets hold relevant information.
Find the Common Thread: Identify a unique key (e.g., customer ID, product code) present in both datasets to link them.
Choose the Join Method: Depending on your needs, use different join types:
- Inner Join: Keeps matching rows from both datasets.
- Left Join: Keeps all rows from the left dataset and matching rows from the right.
- Right Join: Keeps all rows from the right dataset and matching rows from the left.
Verify the Outcome: Check for unexpected duplicates, missing information, and alignment of joined data points.

Bonus Tips:

Automate with scripts: Develop Python or other language scripts to streamline repetitive validation and joining tasks.
Document your process: Clearly explain your data cleaning and joining decisions for future reference and reproducibility.
Collaborate with others: Peer review your validation and join methods to benefit from different perspectives.

Practice Makes Perfect:

This tutorial provides a framework, but the best way to master these skills is to practice. Load datasets from real-world sources, apply the techniques we discussed, and see how clean and joined data empowers your analysis. Remember, validated and joined data is the foundation for accurate insights and impactful conclusions!

Additional Resources:

Pandas DataFrame documentation: https://pandas.pydata.org/docs/
Python libraries for data validation: https://pypi.org/project/validation/
Tutorials on data joining in Python: https://www.w3schools.com/python/pandas/ref_df_merge.asp

When I think about input validation and
the EDA practice of joining, I like to think about vegetables. Allow me to explain, when you’re at the market picking out
leafy greens and root vegetables, you check that they’re fresh first,
right, and not just at the store. You also check on them at home
when they’re in the refrigerator. Plus you probably ensure that
they’re still edible one more time before you cook and eat them. Lastly, if for some reason
the vegetables have gone bad or you didn’t buy enough for
a recipe you would need to add more. Performing EDA on a data set is similar to
checking for the freshness of vegetables. You are searching, exploring and checking that the data is
as error free as possible. You should check and recheck your data
sets to make sure that they are correct. As I mentioned before as
a data professional you should know your dataset thoroughly. One way to make sure you know
a data set is validating it, which is as you recall,
one of the six practices of EDA. There are many different ways
to validate a set of data, but here we’ll discuss Input validation. Input validation is the practice
of thoroughly analyzing and double checking to make sure data is
complete, error free, and high quality. Input validation is intended
to be an iterative practice, meaning you should perform it again and
again in between and during the other five practices of EDA,
which are discovering, structuring, cleaning,
joining and presenting. Most often as a data professional you
will perform input validation when starting a new analysis project or
getting familiar with a new data source. We will discuss more about the how of
performing input validation later in the program. Now we will focus on the why and the what? Why should we take
the time to validate data? What exactly are we looking for? When we validate data, we help make
more accurate business decisions and we improve complex model performance. Think about it like this if a gourmet chef
doesn’t double check the vegetables for freshness before cooking a dish
the food could taste horrible or worse, make people sick. It is much the same with data. The more careful we are in checking and
rechecking our data after each manipulation of the data during EDA, the
less likely we are to have problems later. Clean and validated data can help
prevent future system crashes, coding issues or wrong predictions. So, while we are performing our validation
work, what is it that we’re searching for? No two data sets are alike. There will be different things
to check for based on the type. Here’s some questions to
consider when validating data. Are all the data entries in
the same format, for example, are people’s ages expressed as
solitary numbers like 23 and 47 or within a range like 18 to 35 and
35 to 50, are all entries in the same range? For instance, in finance are some of the
values expressed in thousands of euros and others in millions of euros. Finally are the applicable data entries
expressed in the same data type. For example, are all the dated
entries expressed in the same format of month, day or year? While asking these questions or
while performing media in general, you may find that the data you have
been given is not sufficient to answer the business questions that
you have been tasked with. For example, let’s go back to the
comparison we made earlier in the video to a gourmet chef and his vegetables. Imagine that the gourmet chef is
introducing a new recipe to their restaurant. They’re not sure how many vegetables
will be needed for the new dish. In the end the chef realizes halfway
through that day that they did not account for additional vegetables to
be used in vegan and vegetarian dishes. It is at that point the chef
would buy more vegetables. The EDA practice of joining
is much the same principle. As you’ve learned, joining is different
from the structuring technique merging. While asking these questions or
while performing media in general. You may find that the data you have
been given is not sufficient to answer the business question. Joining is the process of augmenting data
by adding values from other data sets. The practice of joining will be most
useful if we validate the data to ensure formatting and data entries
aligned and are the same data type. For example, while adding the new
vegetable, the chef will want to make sure the consistency and taste are similar
to the vegetables that they’ve been using all day, otherwise
the dish may not turn out the same. You will need to use your own logic,
common sense and experiences to understand the ways
in which you should join or validate each particular
data set you work on. There won’t be a rigid process to show
you down to the detail how to handle each data file. Data science takes a lot
of analytical thinking and shifting of perspectives to be
thorough in your EDA and validation. Experience and effort will be the best way to improve
your performance in analytical thinking. We’ll also be exploring examples
in our Python notebooks as well. Keeping your validation practices in line
with the PACE workflow will also help you keep a focus on ethics. Validation should be about cleaning and
correcting data for the sake of quality and correctness. You probably won’t be surprised to learn
that the EDA practice of validating fits squarely in the analyze phase
of our PACE workflow, but is also a practice you should use for
all four phases. Remember, PACE stands for plan,
analyze, construct and execute. This means that what and how we join and validate data should be in alignment
with the plan phase of the workflow. For example, if your task is to discover
which week within a given month is most profitable for a business,
it will be important that you check and recheck that you’ve grouped the revenue
dates into weeks correctly. Your analysis won’t be effective, if you think you’ve grouped revenue by
week but actually done it by day or month. When you look to validate your
data to this level of detail, it will help you towards
meeting your PACE goals, specifically for
the construct and execute phases. Before we end, there’s one more thing
that can help you with validating and joining aspects of EDA. Look to your peers and
managers to help you. A peer reviewed data set is one of
the best ways to ensure that you check yourself on bias,
keep ethics as a focus and ensure you are on track
with the PACE workflow. Remember, validating data is like checking
your vegetables to find if they’re the right choice for the recipe. It may take extra time up front but
it may save you from headaches and stomachaches down the road.

Video: Input validation with Python

Notes

Tutorial

Transcript

Summary of Input Validation in Python with NOAA Data:

Goal: Validate a NOAA lightning strike dataset for publication using Python.

Key Steps:

Import libraries: pandas, matplotlib, plotly.express, seaborn.
Analyze data:
- Check data types and convert date column to datetime.
- Verify no missing values.
- Review variable ranges and data distribution.
Validate dates:
- Check for missing calendar days.
- Identify and document missing days (4 in June, 2 in Sept, 2 in Dec).
Visualize data:
- Create box plot of number of strikes (skewed distribution).
- Remove outliers to better understand majority distribution.
Validate locations:
- Create DataFrame with unique latitude and longitude points.
- Verify no duplicate data points.
- Plot points on a map using Plotly Express scatter geo plot.

Importance:

Input validation ensures data quality and completeness for reliable analysis and publication.
Mastering these techniques is crucial for data professionals.

Note: This summary highlights the main points but omits some details for brevity.

Here’s a tutorial on Input Validation in Python with NOAA Data:

Introduction

Define input validation as the process of ensuring data quality and completeness.
Highlight its importance for accurate analysis and publication.

Setting Up

Import necessary libraries:

import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

2. Load the NOAA lightning strike dataset:

df = pd.read_csv("noaa_lightning_data.csv")

Data Analysis

Check data types:

df.dtypes

Convert the “date” column to datetime if needed:

df["date"] = pd.to_datetime(df["date"])

2. Identify missing values:

df.isnull().sum()

3. Review variable ranges and distributions:

df.describe(include="all")

Date Validation

Check for missing calendar days:

full_date_range = pd.date_range(start="2018-01-01", end="2018-12-31")
missing_dates = full_date_range.difference(df["date"])
print(missing_dates)

Document any missing dates.

Data Visualization

Create a box plot of the number of strikes:

sns.boxplot(data=df["number_of_strikes"])
plt.show()

Remove outliers for better interpretation:

sns.boxplot(data=df["number_of_strikes"], showfliers=False)
plt.show()

Location Validation

Create a DataFrame with unique latitude and longitude points:

df_points = df[["latitude", "longitude"]].drop_duplicates()

Plot the points on a map using Plotly Express:

fig = px.scatter_geo(df_points, lat="latitude", lon="longitude")
fig.show()

Conclusion

Emphasize the importance of input validation for reliable analysis.
Encourage practicing these techniques for data quality assurance.

In this video, you’ll perform
input validation in Python. Remember, input validation is the practice of
thoroughly analyzing and double-checking to
make sure data is complete, error-free
and high-quality. Just like checking your
vegetables before you cook them. With that knowledge
fresh in your mind, it’s time to discuss how to perform input
validation in Python. We’ll focus our validation
on the NOAA data The goal will be to
check for errors and prepare this dataset
for publication. Let’s open our Python
notebooks and get started. First step, as usual, we will import the
Python libraries and packages that we need
for input validation. You’ll notice we have
pyplot from matplotlib, pandas, plotly.express,
and seaborn. Next, with our lightning
strike data libraries and packages all imported
in our notebook, let’s use the head function
to take a look at what we have in terms of columns
and the top five rows. You’ll find our familiar
lightening strike dataset also includes a fourth
and fifth column, longitude and latitude
respectively. If we were to check the
datatypes now using df.dtypes, we would discover the date
column is a string datatype. As we did previously, let’s convert the
column to datetime. One thing we should include in our input validation is a
check for missing data. You’ll recall from earlier that we likely will have already done this check for missing entries during the cleaning part of EDA. The definition of
validate assumes an action or process has
already been completed, so if we input isnull.sum
into our notebook in order to give us missing
values for the second time. It is a practical and not
redundant thing to do. After running the code, we find that there are
indeed no missing values. Next, for this dataset, let’s review the ranges
of all the variables. In this review,
we will check the highest and the lowest
values of each column, and the overall
distribution of values. To do this, we’ll use
the describe function, which you may recall
from previous video. In the argument
field will use the parameter includes equal, all. This will tell the
notebook to include every parameter covered
by the describe function. You’ll notice NaN
values in some of the describe fields like
unique, last and max. Remember that these are
not fields in the dataset, but rather a summary
table of attributes. For example, you’ll
notice that there are NaNs in the date
column for mean, min, 50 percent, and max. These are to be expected
because dates are not datatypes that
would be averaged. Speaking of dates, let’s
double-check our date column. First, are there any
calendar days missing? We know from prior EDA
work on this dataset that the date column should include
every day of each year. Let’s confirm each
day is listed by designing a calendar index, we’ll call it full date range. We’ll then use the pandas date range function with the
first and last dates of 2018 in the argument field under the start and
end parameters. After that, let’s add
full_date_range.difference. Then in the argument
field we’ll add df, and in the brackets,
the column header date, which will limit
this function to checking only the date column. The result is
interesting. There are four consecutive days
missing in June of 2018, then two consecutive
days in September, and two consecutive
days in December. This finding would
be something to investigate or to question
the owner of the data, to find out the reason why. Given that the number of missing days is
relatively small, you can complete your
analysis by making a note of the missing
days in the presentation. This will ensure that anyone who analyzes your visualization or presentation will
know that the data depicted doesn’t include
those missing dates. Now that we’ve done some
initial validation, let’s get a better
sense of the range of the total numbers
of lightening strikes. If we use a simple box plot
data visualization with number of strikes as our data parameter in
the argument field, we’ll find that the
distribution is a very skewed. Most days of the year have less than five
lightning strikes, but some days have
more than 2,000. As you recall from
a previous video, we can remove the outliers from the boxplot visualization to help us understand where the majority of our
data is distributed. We can do that by
adding show flyers equals false into
the argument field. The result is much
easier to interpret. From what we know of
lightning strikes in the North America regions
of the United States, Mexico, and the Caribbean, this distribution
is conceivable. If the highest distribution of strikes were all in the 2000s, we might be a little
more skeptical. The last bit of input
validation we will do is to verify that all of the latitudes and
longitudes included in this dataset are in
the United States. This will ensure we don’t have any mistakes in our
geographic values. To do that, we’ll first create a new DataFrame that removes
all the duplicate values. We’ll type df_points equals df, then latitude and longitude
in double brackets. Following that, we’ll
add.drop_duplicates. The reason we want to drop duplicate points for the
latitude and longitude is that we don’t need to check a particular location
on a map twice. After running that code, we have a new
DataFrame that only includes two columns,
latitude and longitude. Within those two columns, we find that there are no
duplicate data points. Finally, we will plot
these points on a map. Here’s a tip. When
you’re working with hundreds of thousands of data
points plotted on a map, it takes a lot of
computing power. To make it a shorter runtime, we’ll use the Plotly
express package. The package is designed to keep runtimes as low as possible. To plot these latitude and
longitude points on a map, we use the scatter geo
plot from Plotly Express. Within the argument field, we’ll fill in the
proper parameters for a scatter geo plot. In this case, we’ll first input the DataFrame name df points,
followed by first lat for latitude then lon for longitude. The last code is to
show the scatter plot. After running this cell, you’ll notice the
runtime is still slow because of the sheer
volume of the data. Most data professionals use input validation regularly, so these are important concepts to learn and important technical skills to practice.

Lab: Exemplar: Validate and clean your data

Reading

Exemplar_Validate-and-clean-your-data Download

Practice Quiz: Test your knowledge: Input validation

Data professionals use input validation to ensure data is complete, error-free, and of high-quality.

True

Data professionals use input validation to ensure data is complete, error-free, and of high-quality.

True. Data professionals use input validation to ensure data is complete, error-free, and of high-quality. It helps prevent invalid or incorrect data from entering a system, which can lead to inaccurate results, wasted resources, and even financial losses.

Fill in the blank: If a dataset lacks sufficient information to answer a business question, the process of _____ makes it possible to augment that data by adding values from other datasets.

joining

If a dataset lacks sufficient information to answer a business question, the process of joining makes it possible to augment that data by adding values from other datasets. Joining is most useful if the new data is validated to ensure its format and data entries align and are the same data type as the original dataset.

The correct answer is joining.

Joining combines data from two or more datasets based on shared attributes or relationships. This allows you to leverage information from multiple sources to enrich your existing dataset and potentially answer your business question even if the individual datasets are insufficient.

Summing, blending, and sampling may be useful for data manipulation within a single dataset, but they are not specifically designed for combining data from multiple sources.

In which phase of the PACE workflow would a data professional perform the majority of the data-validation process?

Analyze

A data professional performs the majority of the data-validation process in the Analyze phase of the PACE workflow. However, it’s important to prioritize data-validation throughout all four phases.

In the PACE workflow, data validation would primarily occur in the Analyze phase. Here’s why:

Plan: In the planning phase, you define the scope and objectives of your data analysis project. You might identify potential data sources and challenges, but the actual validation and cleaning of the data happens later.
Construct: This phase involves building the tools and infrastructure needed for data analysis. While some initial data exploration might happen here, the bulk of validation would still occur in Analyze.
Execute: In the execution phase, you run your data analysis tasks and scripts. Data validation is crucial before this step, as invalid data would lead to unreliable results.
Analyze: This is where you dive deep into the data, exploring trends, patterns, and relationships. Before drawing any meaningful conclusions, you must ensure the data’s quality through thorough validation. This includes checking for missing values, inconsistencies, outliers, and other issues.

Therefore, the Analyze phase is where data professionals dedicate the most effort to data validation to prepare the data for reliable analysis and interpretation.

Review: Clean your data

Video: Wrap-up

Notes

Transcript

Cleaning, joining, and validating data are crucial practices in Exploratory Data Analysis (EDA).
We learned to handle missing values and outliers, transform categorical data, and perform input validation using Python.
Visualization plays a critical role in understanding data and revealing hidden stories.
Workplace skills like communication and ethical considerations are essential in data cleaning tasks.
Thorough EDA saves time, ensures data quality, and helps uncover important patterns and trends.
This is a valuable skill for data professionals and a foundation for further data analysis work.

Key takeaways:

EDA is not just about tidying data; it’s about unlocking its potential.
Python is a powerful tool for data manipulation and validation.
Combining technical skills with communication and ethical awareness is crucial in data analysis.

Next: Building stunning data visualizations for effective communication with stakeholders.

As you learn data sets can be busy and
chaotic. There are many different
things happening in them. It can be difficult to figure
out where to focus and how to uncover those stories
buried beneath the surface? In this section of the course, you learn
some important strategies to help you correct mistakes and
discover hidden data stories. We focused our story finding efforts by
learning the EDA practices of cleaning, joining and validating. We discussed a lot of ways to
engage in those practices. We focused on working with
missing values and outliers, transforming categorical into
numerical data and input validation. And we reminded ourselves not to
forget to visualize our data in python to help further our understanding. We discussed identifying missing data and
outliers in python and why it’s important both from an ethical perspective
and a business sense to find them. We considered the difference between
categorical and numerical data and why transformation using
python is important. As for input validation,
you learned what it is? Why it’s important and
how to perform it using python? We also discussed some workplace skills
along the way such as understanding when to communicate with stakeholders or other subject matter experts
about missing values. We also reviewed ethical considerations
you should make when performing cleaning, joining and validating work. You learn why these practices
are vital when performing EDA an a data set as part
of the pace workflow. EDA can be an exciting process,
thorough EDA not only saves time and energy on future processes but
it also helps you find the trends, patterns and stories in the data. Later, you’ll learn how to use tableau to
design and present data visualizations to key stakeholders and
business managers in a business setting. In the meantime, you’ve learned some amazing skills that
data professionals use almost every day. The ability to perform careful EDA on a
dataset in python is an essential building block in a data professionals career. Great job on your work so far.

Reading: Glossary terms from module 3

Terms and definitions from Course 3, Module 3

Categorical data: Data that is divided into a limited number of qualitative groups

Collective outliers: A group of abnormal points, following similar patterns and isolated from the rest of the population

Contextual outliers: Normal data points under certain conditions but become anomalies under most other conditions

Data ethics: Well-founded standards of right and wrong that dictate how data is collected, shared, and used

Data governance: A process for ensuring the formal management of a company’s data assets

Deduplication: The elimination or removal of matching data values in a dataset

Docstring: (Refer to documentation string)

Documentation string: A group of text that explains what a method or function does; also referred to as a “docstring”

Dummy variables: Variables with values of 0 or 1 that indicate the presence or absence of something

Global outliers: Values that are completely different from the overall data group and have no association with any other outliers

Heatmap: A type of data visualization that depicts the magnitude of an instance or set of values based on two colors

Input validation: The practice of thoroughly analyzing and double-checking to make sure data is complete, error-free, and high quality

Joining: The process of augmenting data by adding values from other datasets; one of the six practices of EDA

Label encoding: Data transformation technique where each category is assigned a unique number instead of a qualitative value

Missing data: A data value that is not stored for a variable in the observation of interest

Non-null count: The total number of data entries for a data column that are not blank

One-hot encoding: A data transformation technique that turns one categorical variable into several binary variables

Outliers: Observations that are an abnormal distance from other values or an overall pattern in a data population

Quiz: Module 3 challenge

Which of the following terms are used to describe missing data? Select all that apply.

Blank, NaN, N/A

All of the listed terms can be used to describe missing data in different contexts:

Blank: In some datasets, blank cells or empty fields may indicate missing data.
NaN: “Not a Number” is a common placeholder for missing numerical values in many programming languages and software tools.
Zero: While zeros can represent actual values, they can also be used to flag missing numerical data, especially when it wouldn’t logically be zero.
N/A: “Not Applicable” is a general term indicating that a value is not available or relevant for a specific data point.

Therefore, all four options are valid ways to describe missing data, depending on the specific context and data format.

A data professional at a garden center researches data related to ideal growing climates. As they familiarize themselves with the datasets, they discover some data is missing. Which of the following strategies can help them solve this problem? Select all that apply.

Derive new representative values based on available data.

Create a NaN category

The best strategies for the data professional to address missing data in this scenario are:

Derive new representative values based on available data: This can be done through techniques like mean or median imputation, where missing values are replaced with statistically relevant values calculated from existing data points.
Create a NaN category: NaN (Not a Number) is a standard placeholder for missing numerical values and allows for clear identification and tracking of missing data without affecting other calculations.

The strategies to avoid are:

Change the missing values to Boolean data (true/false): This simplifies the data but risks masking valuable information and introducing bias if not done carefully.
Add in the missing values by taking the average values from the existing data: Simply averaging existing data can distort the overall picture and obscure potential outliers.

Therefore, the correct answers are:

Derive new representative values based on available data.
Create a NaN category.

Remember, the best approach for handling missing data depends on the specific context and the underlying data distribution. Analyzing the data and understanding its meaning is crucial before choosing an appropriate strategy.

A data professional writes the following code:
df.merge(df_zip, how=’left’,
on=[‘date’,’center_point_geom’])
Which of the following is a parameter for the merge?

how=’left’

The correct answer is how=’left’.

In the given code, df.merge() is the function being used to merge two DataFrames, df and df_zip. Parameters provide additional information to functions to control their behavior. Here’s a breakdown of the parameters used in the code:

Parameters:

df_zip: This is the second DataFrame being merged with df. It’s not a parameter of the merge function itself, but an argument passed to it.
how=’left’: This is a parameter that specifies the type of merge to perform. It instructs the function to perform a left merge, keeping all rows from the left DataFrame (df) and only matching rows from the right DataFrame (df_zip).
on=[‘date’,’center_point_geom’]: This parameter indicates the columns to use as keys for joining the DataFrames. It ensures that rows with matching values in these columns are combined.

Non-parameters:

df.merge(): This is the function call itself, not a parameter.
df_joined: This name is not used in the code provided. It could potentially be used to store the result of the merge, but it’s not a parameter of the function.
df.head(): This is a separate function that displays the first few rows of a DataFrame. It’s not involved in the merging process.

What pandas function is used to pull all of the missing values from a data frame?

pd.isnull()

The correct answer is pd.isnull().

Here’s how it works:

Apply pd.isnull() to the DataFrame: This function creates a Boolean DataFrame with the same shape as the original, where True indicates missing values and False indicates non-missing values.
Select desired rows or columns: You can then use this Boolean mask to filter or access the missing values:
- Filter rows: To retrieve all rows containing missing values, use df[df.isnull().any(axis=1)]. This selects rows where at least one element is missing.
- Filter columns: To retrieve all columns with missing values, use df.loc[:, df.isnull().any(axis=0)]. This selects columns with at least one missing value.

Remember that pandas also offers pd.notnull(), which is the inverse of pd.isnull() and identifies non-missing values.

Fill in the blank: Contextual outliers are normal data points under certain conditions but become _____ under most other conditions.

anomalies

The correct answer is anomalies.

Contextual outliers are data points that deviate significantly from the overall pattern of a dataset within a specific context. They might appear normal under certain conditions but become unusual or unexpected when considered in a broader context.

For example, in a dataset of daily temperatures, a value of 35 degrees Fahrenheit might be considered a normal outlier during winter but would be an anomaly in the summer. Similarly, a sudden spike in sales for a particular product during a holiday season might be expected, but the same spike during a regular week would be unusual.

Contextual outliers can often provide valuable insights into hidden patterns or unexpected events within the data. It’s crucial to consider the context when identifying and interpreting outliers to avoid misinterpretation or misleading conclusions.

A data team works for a stereo installation company. To gain insights into what products people are most likely to purchase in the coming year, they review categorical data about 20 of the most popular stereos. Rather than using brand names, they assign a different number to each stereo to make the data simpler to join. What does this scenario describe?

Label encoding

The scenario describes label encoding. Here’s why:

Data is assigned numbers: Brand names are replaced with unique numerical identifiers to simplify data merging and analysis. This avoids issues with string comparisons and simplifies calculations.
Categorical data: The scenario focuses on analyzing different stereo models, which are inherently categorical data (not numerical).
Normalization and data smoothing: These techniques typically deal with continuous numerical data and involve scaling or transforming values to fall within a specific range or smooth out fluctuations. In this case, brand names aren’t being adjusted to fit a certain range or smoothened; they’re simply replaced with distinct labels.
Aggregation: This involves combining or summarizing data points to form a higher-level view. While aggregating the stereo data might be part of the analysis later, the scenario specifically focuses on replacing brand names with numbers, which is the essence of label encoding.

Therefore, the scenario most accurately describes label encoding, a common technique for representing categorical data as numerical values for easy analysis in data wrangling and machine learning tasks.

Fill in the blank: A heat map is a data visualization that displays the magnitude of a set of values using _____ to show the concentration of the values.

colors

The correct answer is colors.

A heat map utilizes a color gradient to represent the magnitude of a set of values. Different shades or intensities of a chosen color palette visually reflect the varying concentrations of the data points. This allows for quick and easy identification of patterns and hotspots, making it a powerful tool for data exploration and analysis.

Vertical bars, markers, and slices are other data visualization techniques, but they wouldn’t accurately describe the essence of a heat map, which relies on variations in color for representation.

What does the pandas function pd.duplicated() return to indicate that a data value does not have a duplicate value within the same dataset?

False

The answer is False.

The pandas function pd.duplicated() returns a Boolean DataFrame where:

True indicates that a particular row is a duplicate of a previous row based on the specified columns.
False indicates that a row is unique and does not have a duplicate within the dataset.

Here’s how it works:

Apply pd.duplicated() to the DataFrame: This creates a Boolean DataFrame with the same shape as the original, marking duplicates as True.
Filter for unique rows: To retrieve only the unique rows, use df[~df.duplicated()], where the ~ operator inverts the Boolean mask to select non-duplicates.

Fill in the blank: The pandas function _____ enables data professionals to create a new dataframe with all duplicate rows removed.

drop_duplicates()

The correct answer is drop_duplicates().

This function is specifically designed to remove duplicate rows from a DataFrame, creating a new DataFrame containing only unique rows. Here’s how it works:

Apply drop_duplicates(): Call df.drop_duplicates() on the DataFrame you want to filter.
Specify columns (optional): To identify duplicates based on specific columns, use the subset parameter: df.drop_duplicates(subset=['column1', 'column2']).
Keep first/last duplicate: By default, the first occurrence of each duplicate is kept. To keep the last occurrence instead, use keep='last'.

For example, df.drop_duplicates(subset=['name'], keep='last') would remove duplicates based on the ‘name’ column, retaining only the last instance of each duplicate name.

Home » Google Career Certificates » Google Advanced Data Analytics Professional Certificate » Go Beyond the Numbers: Translate Data into Insights » Module 3: Clean your data

Module 3: Clean your data

The challenge of missing or duplicate data

Video: Welcome to module 3

Video: Methods for handling missing data

Reading: Data deduplication with Python

Identifying duplicates

Decision time: To drop or not to drop?

Don’t be duped — How to do deduplication

Key Takeaways

Additional Resources

Lab: Annotated follow-along guide: Work with missing data in a Python notebook

Video: Work with missing data in a Python notebook

Video: Remy: A day in the life of a data professional

Lab: Exemplar: Address missing data

Practice Quiz: Test your knowledge: The challenge of missing or duplicate data

The ins and outs of data outliers

Video: Account for outliers

Reading: Protect the people behind the data

We’ve all been there…

The big picture

An ethical mindset

Key Takeaways

Video: Identify and deal with outliers in Python

Reading: Reference guide: How to handle outliers

Practice Quiz: Test your knowledge: The ins and outs of data outliers

Change categorical data to numerical data

Video: Sort numbers versus names

Video: Label encoding in Python

Reading: Other approaches to data transformation

Label encoding

Some potential problems with label encoding

One-hot encoding

Label encoding or one-hot encoding: How to decide?

Key takeaways

Reading: Reference guide: Data cleaning in Python

Practice Quiz: Test your knowledge: Changing categorical data to numerical data

Input validation

Video: The value of input validation

Video: Input validation with Python

Lab: Exemplar: Validate and clean your data

Practice Quiz: Test your knowledge: Input validation

Review: Clean your data

Video: Wrap-up

Reading: Glossary terms from module 3

Quiz: Module 3 challenge

Share this:

Like this: