Skip to content

Finding stories in data using EDA is all about organizing and interpreting raw data. Python can help you do this quickly and effectively. You’ll learn how to use Python to perform the EDA practices of discovering and sculpting.

Learning Objectives

  • Identify ethical issues that may come up during the data “discovering” practice of EDA
  • Use Python to merge or join data based on defined criteria
  • Use Python to sort and/or filter data
  • Use relevant Python libraries for cleaning raw data
  • Recognize opportunities for creating hypotheses based on raw data
  • Recognize when and how to communicate status updates and questions to key stakeholders
  • Apply Python tools to examine raw data structure and format.
  • Use the PACE workflow to understand whether given data is adequate and applicable to a data science project
  • Differentiate between the common formats of raw data sources (json, tabular, etc.) and data types

“Discovering” is the beginning of an invest


Video: Welcome to module 2

  • Importance of understanding data: Data professionals should thoroughly understand their data, unlike explaining something they don’t grasp.
  • Course focus: Exploring data through Exploratory Data Analysis (EDA) on realistic datasets.
  • Learning sequence:
    1. Identifying different data types and sources.
    2. Practicing EDA through Python coding:
      • Discovery: Uncovering data properties using Python functions and visualizations.
      • Structuring: Transforming and manipulating data with various Python operations.
    3. Addressing data inconsistencies and questions.
    4. Aligning EDA with a structured workflow.
  • Benefits of comprehensive EDA:
    • Uncovering hidden stories within the data.
    • Forming the foundation for useful visualizations and models.
    • Preventing misinterpretations based on incomplete understanding.

Key takeaways:

  • Treat data exploration as a meticulous investigation.
  • Seek the hidden stories within data sets.
  • Thorough EDA is crucial for reliable data analysis and applications.

Have you ever tried to
explain something to a friend that you didn’t
understand very well. Maybe it was a difficult
math assignment, a complex news story or the way a family member
cooks a favorite dish. When you’re trying
to teach a topic you’re not an expert on, it can be challenging to explain the details and give
clear instructions. As a data professional, you never want to
be in this position with the data you analyze. In fact, your goal should be
to know your data very well. When you’re reviewing
a table of data, it’s important to understand
where the data is from, what the column headers mean, what the data will be used for, what the imperfections are, and the small
details in between. Making sense of raw data
is why we are here. Welcome. I’m excited for you to build on the knowledge
that you’ve learned so far to perform exploratory
data analysis or EDA on the kind of data sets
you’ll see on the job. You’ll start by learning about the many different
data types and data sources that you
will encounter in your work and how to study them. After that, you will return to those Python notebooks to start coding the concepts
you’ve learned. We’ll also go into
more details on the first two practices of EDA: discovering
and structuring. While learning about the
discovery in practice, we will use popular
Python functions to get to know the information
contained in the data sets. You’ll learn to use different
visualization techniques to uncover hidden correlations
and connections in the data. As restructuring, you will learn to apply
Python functions to large datasets for many different operations
including sorting, filtering, extracting, slicing, joining, and
indexing large datasets. You’ll learn to make
basic corrections or formatting improvements to hold data columns or
entire data sets. All of these techniques and
practices will help you learn about the data and find the story that
needs to be told. Along with all the Python
functions and coding scripts, we will talk about
what to do when questions about the data set
arise that you can answer, like why there are missing
data fields, for example. We’ll also make sure the EDA you perform aligns with a pace
workflow that we have set. At its core this
section of the course is about digging
through a data set for the first time and investigating it as
meticulously as you can. It is up to you to find
the stories in the data. Most data professionals
will tell you that comprehensive EDA is the key to useful visualizations
and models. If there are questions
left unanswered or misunderstanding still present
in the data after EDA, any presentation or
machine learning model based on it will not be
particularly useful. It might feel like
trying to explain a math assignment you
don’t fully understand. Remember, all data sets
have stories to tell. Let’s get to work and find them.

Video: Yaser: Understand data to drive value

Key Points:

  • From Marine to Finance: Yaser, a Nicaraguan immigrant raised in Philadelphia, joined the Marines but discovered a passion for finance through a mentor.
  • Self-Taught Data Savvy: He independently learned advanced data analytics to gain deeper insights for his role as a Senior Financial Analyst at Google.
  • Helping Partners Make Decisions: Yaser works closely with business partners to interpret financial data, providing them with valuable insights to guide their decisions.
  • Conquering Online Courses: He encourages persistence and focusing on daily progress when tackling lengthy online courses, comparing it to running a marathon one mile at a time.
  • Enthusiastic Motivation: Yaser believes in the learning potential of everyone and expresses encouragement for those pursuing online education.

Overall Message:

Yaser’s story showcases the power of mentorship, self-education, and focusing on daily progress to achieve goals, particularly in the field of data-driven finance. He is a positive role model and advocate for online learning, inspiring others to believe in their potential.

[MUSIC]
My name is Yaser and I’m a Senior Financial Analyst at Google. I was born in Nicaragua and I immigrated to the United States
when I was two years old. Grew up in inner city Philadelphia. After high school, I joined the United
States Marine Corps, spent time in Japan, South Africa, Saudi Arabia, and Panama. One of the big reasons I got interested
in finance is because when I was in my first duty station in South Africa, I befriended the ambassador who had
previously worked in Wall Street. He took me under his wings and mentored
me and taught me a lot of things about personal finance, corporate finance, and
the global economy. And my mind shifted from,
I’m going to be a marine forever to, I’m going to get out of the Marine Corps
after my four years and go back to school. And it worked out [LAUGH]. I learned about advanced data
analytics on my own actually between random YouTube channels,
just Googling stuff. I got into big data analytics
partly because to understand and drive value in your organization,
you have to really understand the underlying data as
a senior financial analyst. The typical day in the life really is
focused around a couple of things, working with strategic business partners. I spend a lot of time trying to figure out
what the right financial view is, so they can have the insights into their business
to influence their decision making. Completing an online course like this is
difficult when you look at that course, scheduling and see that it’s
three months or six months long. Just focus on that next day and
stick to your schedule. Show up every day. It’s like running a marathon. You’re not running 26.2 miles,
you’re running one mile, 26.2 times. Just focus on the very next thing. I believe in you, you can do it.

Video: Where the data comes from

Key Points:

  • Cooking analogy: Working with data in EDA is like preparing a meal for the first time. You have a recipe (project plan), ingredients (data), and need to figure out how to best work with them.
  • Data sources: Understanding where data comes from (internal, external) and who maintains it helps assess its reliability and answer questions during analysis.
  • Data formats: Recognizing common formats like CSV, JSON, and databases allows choosing the right tools for manipulation and analysis.
  • Data types: Familiarity with numeric, geographic, demographic, and other types helps interpret what the data actually represents.
  • Workflow alignment: Checking if the available data fits the project plan and if there’s enough for accurate analysis is crucial.
  • Communication: If data is insufficient or unsuitable, reach out to data owners and stakeholders to adjust the plan and ensure everyone stays focused.

Overall message:

Data professionals learn to “cook” with different data formats and types, adapt to challenges like missing ingredients, and communicate effectively to ensure successful analysis and problem-solving.

Remember: Being a data professional is like being a chef – creativity, resourcefulness, and collaboration are key to turning raw ingredients into delicious insights!

Imagine for a moment you’re
preparing a meal for friends. You have a recipe
and raw ingredients, but this is the first time
you’ll cook the dish. Of course, you’ll want
your friends to enjoy it. If you’re like me the idea
of being judged on perfectly preparing a recipe
the first time you’ve made it is scary. Believe it or not
this cooking example is what it’s like to be a data professional during the
discovery practice in EDA. The recipe you’re
trying to follow is equivalent to your
company’s project plan. You have everything you
need at your disposal, a kitchen or a digital workspace with dedicated servers for
data analysis computing, raw ingredients or a dataset and it’s up to you
on how to best mix, blend, and cook the
ingredients in order to make a winning dish that
your friends will enjoy. In this video let’s focus
on raw ingredients. As a data professional
you will be handling all different data in a number of different
formats and file types. I will share with you the
most common data sources, data formats, datatypes and
a few Python functions. Once you’re familiar
with the dataset source, format and the data types
within it you’ll be ready to handle questions or challenges as they arise during
the analysis. First, when you’re given data you’ll want to
know the source. The term data source can have many different meanings
in different contexts., but for our purposes
we’ll be defining data source as the location
where the data originates. One good thing to know about a data source is how and when to contact subject
matter experts such as engineers or database owners. These are the people
who either generate the data or are in charge
of delivering the datasets. When you have questions
about the data as you do your discovery they will be
the ones you reach out to. Knowing the ownership
and source of the data is critical because by understanding the data source
and who is responsible for it you can determine
its reliability. Does the data owner have experience collecting
and storing data? Does the data owner have any financial stake
in the data’s output? Understanding the source of the data will help
you in telling the story of the data and make ethical decisions
about its use. Another important
part of determining the data source is understanding
how it’s collected. Whether the data was
collected through a report from a computer system, a custom selection from
a large online database, or a data table that has
been manually entered, knowing about how the data
was gathered will help you understand questions
that may come up during EDA. For example, missing values could mean many
different things. Maybe the database owners either didn’t know or didn’t
want to disclose the data for manual
entry or there might be lagging data or system bug
from an online database. The next thing you’ll
need to know about your data sources is
the data file format. The main data formats
you’ll experience as a data professional are tabular
files, XML files, CSV or comma-separated
value files, spreadsheets, database files or JSON files which stands for JavaScript
Object Notation. Here are few examples of what
these file types look like. If you’ve gotten this
far into the program, you’ll be familiar with tabular
and Excel files by now. As you know they organize
data in tables with data variables organized
by rows and columns. Rows representing the objects, and columns representing aspects of the objects in the dataset. The advantage of
this file type is a clear identification of
patterns between variables. CSV files are simple
text file which can be easy to import or
store in other softwares, platforms, and databases. They look like rows
of text and numbers separated by commas
or other separators. The rows of data
are broken up by commas rather than
strict columns. The advantage of CSV files is more from a computer
science aspect in that it is really a file type which is easily
read even in a text editor. It is also easy to
create and manipulate. In Python, you can use the
read_csv function to read, write, and work with tabular data that is
in the CSV format. Database or DB for short is another way of storing
data often in tables, indexes, or fields. Database files are great
for searching and storage. They often require some
basic knowledge of Structured Query Language or SQL. We will explore how
to query data from databases in an
upcoming section. Lastly, JSON files are data storage files that are
saved in a JavaScript format. You’ll find the
information in these files better resembles Python code, but with different language,
function and format. JSON files may contain
nested objects within them. You can think of
nested objects as expandable file folders or drop-down menus within
the code itself. For example a JSON
file might have the ingredients for a recipe listed and under
each ingredient, you’d have included nested
information like weight, calories, and price as objects that define
the ingredients. There are a few
advantages of JSON files. They have a small message size. There are readable by almost
any programming language, and they help coders easily distinguish between
strings and numbers. There are Python libraries and functions built for
working with JSON files. As you learned previously, you can import the JSON
Python module into your Python notebook to use as an encoder and decoder
of JSON files. There are also tools
in the pandas called read_json and to_json, which will translate
JSON files and convert an object to JSON format
type, respectively. As a data professional, you may also be tasked to find data stories in other formats, like HTML, audio files, photos, email, text
images, or text files. In every case, there are Python libraries or
functions that can help you discover and structure the data for your
research project. There is no best
format for the data. It’ll depend on the
project and storage type. You’re just looking
for the format that best fits that
particular dataset. Finally, the last raw ingredient in understanding data
is types of data. You may already be familiar with the many different
categories of data. As a reminder, there is a first, second, and third party data. First-party data
is data that was gathered from inside
your own organization. Second party data is data
that was gathered outside your organization but directly
from the original source. Finally, there’s
third party data, which is data gathered outside your organization
and aggregated. Knowing the types of data, whether first, second,
or third party, will help you be able to
more efficiently answer or seek answers to
the questions you may have that arise
during analysis. For example if there are missing values in first-party data, then someone in your
organization can help you determine whether the
missing data can be recovered. For third-party data,
you will likely have to reach out to a
separate organization. You’ll also be familiar
with different types of data like geographic, demographic, numeric, time-based, financial,
and qualitative data. Your job as a data
professional require that you understand and work with
all these types of data. Knowing the data source, format and type you’re
working with will help you to answer two
very important questions. First, given what you
know of the data so far, does it align with the plan as defined by your
pace workflow? Second, do you have enough data to follow through with the
plan in the pace workflow? If you’ve answered
either these questions with a no or not sure, it is your job to reach out
to the owners of the data and project stakeholders to inform them of the
issue you found. For example, let’s say
you’re assigned to predict the number of customers a retailer will expect
to see next month. Unfortunately, we’re only given profit margin data and only two months worth of
customer purchase data. Because the profit margin data won’t help you with returning customers and only two months of data won’t give you much
confidence and prediction, you should go back
to the data source. You’ll need customer
purchase data from the last few years to accurately predict
an upcoming month. Returning to the data
source and requesting more data will keep
everyone in the process, including you, focused on the plan that was established as part of
your pace workflow. When you’re working as
a data professional, keeping this focus will be
essential in identifying and carrying out high priority
and value-add tasks. As a data professional, you’re expected to know what the data means and
how to work with it to find solutions to business problems
like cooking a meal, you’ll be given the tools and ingredients you need to
work on the problem. It is up to you to work with
your team if there aren’t enough ingredients or if what you’ve been given won’t
make a great dish.

Fill in the blank: _____ is data gathered from inside your own organization.

First-party

First-party data is data gathered from inside your own organization.

Reading: Reference guide: Import datasets with Python

Reading: Reference guide: Pandas methods for the discovery of a dataset

Python reference guide for EDA: Discovering

Lab: Annotated follow-along resource: EDA using basic functions with Python

Video: EDA using basic data functions with Python

Dataset: National Oceanic and Atmospheric Administration (NOAA) lightning strikes in North America for 2018.

Goal: Perform exploratory data analysis (EDA) to predict future lightning strikes.

Steps:

  1. Import libraries and data: Pandas, NumPy, matplotlib.pyplot (abbreviated as pd, np, plt, dt).
  2. Convert date column to datetime: Allows grouping by time segments.
  3. Inspect data: Use head to view first 10 rows, analyze column headers, understand data types (info).
  4. Prepare data for visualization: Group daily strikes by month using dt.month and create month abbreviations.
  5. Create and plot visualization: Group and sum strikes by month, plot bar chart with plt.bar.

Findings:

  • August 2018 had the most lightning strikes.
  • November and December 2018 had the least strikes.

Next steps:

  • Explore further visualizations, e.g., map of strikes by state.
  • Analyze factors influencing strike patterns.
  • Build predictive models for future lightning strikes.

Learning objectives achieved:

  • Understand the importance of EDA for data analysis.
  • Read and analyze logs during incident investigation.
  • Describe how IDS and SIEM tools provide security value.
  • Interpret the basic syntax of signatures and logs.
  • Describe how SIEM tools collect, normalize, and analyze log data.

This summary captures the key aspects of the EDA practice, highlighting the process, tools used, findings, and future directions. It also aligns with the provided learning objectives, demonstrating the application of EDA in a real-world scenario.

Introduction:

  • Exploratory Data Analysis (EDA): The process of examining a dataset to understand its structure, summary statistics, and patterns.
  • Python Libraries: We’ll use Pandas for data manipulation and NumPy for numerical operations.
  • Goal: To gain insights and guide further analysis or modeling.

1. Import Necessary Libraries:

Python

import pandas as pd
import numpy as np

2. Load the Dataset:

Python

df = pd.read_csv('your_data_file.csv')  # Adjust for different file types

3. Initial Inspection:

Python

# View first 5 rows
print(df.head())

# Get information about the dataset
print(df.info())

# Summary statistics
print(df.describe())

4. Handling Missing Values:

Python

# Check for missing values
print(df.isnull().sum())

# Handle missing values (e.g., fill with mean, drop rows, or use imputation techniques)

5. Data Cleaning:

  • Fix inconsistencies (e.g., typos, formatting issues).
  • Normalize or standardize data if needed.

6. Data Manipulation:

Python

# Select specific columns
selected_columns = ['column1', 'column3']
df_subset = df[selected_columns]

# Filter rows based on conditions
filtered_df = df[df['column2'] > 50]

# Create new columns
df['new_column'] = df['column4'] * 2

# Rename columns
df = df.rename(columns={'old_name': 'new_name'})

7. Data Visualization:

Python

import matplotlib.pyplot as plt

# Histogram for numerical data
plt.hist(df['column1'])
plt.show()

# Bar chart for categorical data
plt.bar(df['column2'].unique(), df['column2'].value_counts())
plt.show()

# Scatter plot for relationships
plt.scatter(df['column3'], df['column4'])
plt.show()

8. Further Exploration:

  • Explore relationships between variables using correlation matrices or scatter plots.
  • Identify outliers and anomalies.
  • Consider dimensionality reduction techniques if needed.

9. Document Findings:

  • Summarize key insights and patterns.
  • Highlight any potential issues or biases in the data.

Remember: EDA is an iterative process. Explore different visualizations and techniques to uncover valuable insights from your data.

Why does a data professional use the Python methods describe(), sample(), size, and shape?

To learn about a dataset

A data professional uses the Python methods describe(), sample(), size, and shape to learn about a dataset.

Now that you have a
better understanding of exploratory data analysis
and why it’s important, let’s do some discovering
on a dataset similar to one you might work with in your career as a
data professional. The National Oceanic and
Atmospheric Administration, or NOAA, keeps a daily record of lightning strikes across
much of North America. Let’s imagine we’ve been
tasked with performing EDA on this dataset so that
it can be used to predict future lightning
strikes in this region. In this video, we will use Python to do the EDA practice of discovering on data gathered
by the NOAA in 2018. We will talk through the first steps data professionals would typically do when encountering a dataset for the first time. Let’s open up a Jupyter
Python notebook to begin. First, we want to import
the dataset we wish to analyze and the applicable python
libraries we want to use. This preparation
step is similar to the way a painter
gathers paint brushes, paints, and an easel before
they start painting. Similarly, we want to collect
all our tools and data. The NOAA dataset we will use
is in the public domain, but we have provided a file to download so
you can follow along. To begin, import the Python libraries and
packages you plan to use. In our case, we’ll use Pandas, NumPy and matplotlib.pyplot. To save time and
increase efficiency, rename each library or package with a
two-letter abbreviation, pd, np and plt, dt respectively. Another quick tip
for saving time. Instead of clicking Run cell, you can always use the
keyboard shortcut, Control or Command Enter. You may recall that
Python packages like Pandas and NumPy are open source code packages that
have been pre-designed and coded to help analyze and manipulate data
more efficiently. Pyplot is a package focused on plotting charts
and visualizations. Most data professionals import all their commonly
used libraries, packages, and modules at the beginning of
any coding session. But you can import them at any time while working with
data in a Python notebook. The pandas conversion
function to date time is incredibly helpful when working
with dates and times. We will first convert our
date column to datetime, which will split up
the date objects into separate data points of
year, month, and day. This is good practice because
throughout your career, you’ll be grouping data by various time segments.
We’ll get to that later. After running these
initial functions, you are ready to begin your EDA discovering practice
with this dataset. Let’s begin with the
head method or function. Head will return as many rows of data as you input in
the argument field. We’ll look first at
the top ten rows. When I input head with a ten in the argument of
field and click “Run”, the first ten rows of the dataset are now
in our notebook. This dataset contains
three columns of data, date, number of strikes, and center point geom. At this point in the exercise, we will need to
clearly understand what each of these
columns means. Date and number of strikes
are fairly straightforward. You’ll want to pay attention to any dates and their format, which in this case is
year, month, date. You’ll also find when
looking at the date column, that lightning strike data
is recorded almost every day for at least one
location in the year 2018. Some column headers are obvious, but some like the column header, center point geom aren’t. If you are not sure about what the column headers mean while you’re working
on a project, you could go back to the
public documentation available from the NOAA to
confirm its meaning. For this video, the
information is provided. The center point geom
column refers to the longitude and latitude
of the recorded strikes. We know the number of columns, but we also want to know how many rows and data
we’ll be working with. Another great EDA discovering
tool is the info function. The method called Info gives the total number and datatypes
of individual entries. Keep in mind that datatypes
are called D types in Pandas. If we type df.info into
our notebook and run it, we will get this output. Our range index tells
us our total number of entries, nearly 3.5 million. We also find that the data
values in our date and center_point_geom columns
are classified as objects, and the data in our number_of_strikes data
column is classified as int64. Int64 refers to integer64. It means that the datatype contains integers
or numbers between negative two to the power of 63
and two to the power of 63. Simply put, int64 is a
standard integer somewhere between negative 9 quintillion and positive 9 quintillion. There’s one more Dtype you might see when performing
the info function, str, which refers to a string. Like objects, you’ll likely be familiar with
strings already. Strings are sequences
of characters or integers that
are unchangeable. Getting back to the
dataset we have a pretty good idea about
its size and scope. We know it has three
columns and we know it has 3,401,012 rows. We know what the columns mean. There are other methods
or functions that we could use during our
practice for discovering. For example, the
methods describe, sample, size, and shape are all useful for
learning about a dataset. For this dataset though, we have what we need in order
to understand what is going on in the dataset from
a discovery standpoint. Next, let’s determine
which months have the most lightning strikes by plotting our
first visualization. Remember, your dataset has
over three million rows. If we don’t do any
categorizing or grouping, we can end up with
a notebook that is stuck endlessly trying
to run the code. We’ve already converted our
date column to datetime, which makes it easy for us to manipulate the
data in the column. We want to group our daily strikes into
something more manageable, like months, for example. Let’s make our date reference the number each month
corresponds to in a year. That is, January is one, February is two, and so on. We’ll do that by creating
the column month using the code df, date,.dt.month. Dt in this instance
stands for datetime. Next, we’ll make it easier
to interpret on a chart. Let’s convert the month data, which are currently numbers
to month name abbreviations. We’ll do this by
creating another column, month underscore txt. This code will take the
string of characters or month names and slice to include only the
first three letters. Before we try to plot
this data on a chart, we need to sum the numbers of strikes in all
locations by month. We can do this by
creating a DataFrame using the groupby
function in which we will include the columns
month and month_txt because these are
the columns we will use to group and order
the lightning strikes. We’ll use the sum
function to add the lightning strikes
for each month together. For our last lines of
code in this video, let’s plot the number of strikes by month for the year 2018 into a bar chart and then total number of
strikes by state on a map. This will help us get a good
sense of the data story. You’ll find that this looks
like a big block of code, but because we are using
the matplotlib.pyplot, the actual code for the visualization is not
too difficult to follow. First, we have the plt.bar
function which will have us enter the data
columns we wish to plot with x-axis first, month underscore txt, followed by height or y-axis, which we will fill
in with number underscore of
underscore strikes. Next, we will give the
bars a legend or label, as it’s called in Python, which we will input
as number of strikes. Lastly, we’ll fill in our
bar graph details with the x and y-axis labels and
the visualization title. Leave the legend and show
argument fields blank for now. After we run the
code you’ll see that August 2018 had the
most lightning strikes, while November and December 2018 had the least amount of strikes. Now you’ve completed your
first experience coding for an EDA practice in Python. Very well done. We’ve a lot more to talk about
and this is a great start.

Lab: Exemplar: Discover what is in your dataset

Practice Quiz: Test your knowledge: Discovering is the beginning of an investigation

Fill in the blank: Tabular, XML, CSV, and JSON files are all types of _____.

It is a data professional’s responsibility to understand data sources because the data’s origin affects its reliability.

Which Python method returns the total number of entries and the data types of individual data entries in a dataset?

Understand data format


Video: Discover what is missing from your dataset

  • The process of Exploratory Data Analysis (EDA) requires a shift in perspective, similar to taking a break from a challenging project to gain new insights.
  • After performing an initial data discovery, you can form a hypothesis based on the data. A hypothesis is a theory or explanation that is not yet proven true.
  • Hypotheses are used as a starting point for further investigation or testing.
  • Asking meaningful questions and forming hypotheses helps you better understand what you want to learn from the data and guides your analysis.
  • You can break down the original problem into smaller chunks to ask more specific questions.
  • Having a plan and contacting subject matter experts or doing your own research can help you answer your questions and test your hypothesis.
  • Making changes to the data, such as organizing or altering it, may be necessary to find the answers to your questions.
  • Stay focused on the problem you’re assigned or the plan you established as part of the pace framework.
  • Answering questions and testing hypotheses will help you uncover the stories hidden in your data.

Understanding Data Gaps:

  • Every dataset has limitations. Identifying these gaps is crucial for accurate analysis and drawing reliable conclusions.
  • Missing information can lead to biased results and faulty decision-making.

Key Steps:

  1. Exploratory Data Analysis (EDA):
    • Start with descriptive statistics:
      • Examine data types, ranges, distributions, and missing values.
      • Visualize distributions and relationships using histograms, scatter plots, box plots, and correlation matrices.
    • Identify patterns and anomalies:
      • Spot inconsistencies or unexpected data points that might signal missing information.
  2. Identify Missing Values:
    • Check for explicit missing values:
      • Represented as “NaN,” “NA,” blanks, or other placeholders.
    • Look for implicit missing values:
      • Values that are technically present but might be invalid or uninformative (e.g., “999” used to represent unknown ages).
  3. Understanding Patterns of Missingness:
    • Random missingness: Data is missing randomly, unaffected by other variables.
    • Non-random missingness: Data is missing systematically, related to other variables.
      • Analyze patterns using visualizations and statistical tests.
  4. Understanding Reasons for Missingness:
    • Data collection errors: Mistakes during data entry or recording.
    • Participant non-response: Individuals providing incomplete information.
    • Data cleaning errors: Accidental removal of data during cleaning processes.
    • Measurement limitations: Inability to collect certain data due to practical constraints.
  5. Strategies for Handling Missing Data:
    • Deletion: Remove cases with missing values (risk bias if non-random).
    • Imputation: Replace missing values with estimates (mean, median, mode, or predicted values).
    • Modeling: Incorporate missingness into the analysis model (e.g., using algorithms that handle missing data).

Additional Considerations:

  • Document missing data decisions: Record rationale for handling choices for transparency and reproducibility.
  • Consider sensitivity analysis: Assess how different missing data approaches affect results.
  • Consult subject matter experts: Seek insights from domain experts to understand potential implications of missing data.

Remember:

  • Missing data is a common challenge in data analysis.
  • Understanding its nature and patterns is essential for making informed decisions.
  • Choose appropriate strategies based on the characteristics of your dataset and the goals of your analysis.

I’ve often found that stepping away
from a challenging project and taking a break helps me find
new perspectives on my work. I like to get a can of seltzer
from the fridge, take a walk and reflect on what I know and what I don’t
know about the project I’m working on. The EDA discovering process requires
a similar shift in perspective. After I perform an initial discovering
of a data set, I’ve usually learned enough about the project to make
a hypothesis based on that data. A hypothesis is a theory or an explanation based on evidence
that is not yet proved true. Data professionals often use
hypotheses as a starting point for continued investigation or testing. Once I form my hypothesis,
I’m in a better position to discover more about the data and
achieve my ultimate goal: telling a story. So far in this course we’ve discussed how
to begin the discovering practice of EDA. You learned how to examine data sources,
data formats and data types. You’ve considered column header
information and averages and you’ve made some initial
visualizations to represent your data. Use Python to determine the size and
scope of the data set and learn when you need to ask the owner
of the data, clarifying questions. After you’ve made sense of the raw data,
you’re ready for the next step of the discovering process, drafting a list
of questions and forming hypothesis. In this video you’ll learn how to ask
meaningful questions about the goal you’ve outlined in the pace workflow
to better understand what is missing from your data set and
what you still need to find out. One way I like to do this is by breaking
the original problem into smaller chunks. Some questions you might ask include how
can I break this data into smaller groups so I can understand it better? How can I prove my hypothesis or,
in its current form, can this data give me the answers I need? Let’s consider these questions in context. For example, imagine you work as a data
analyst for an international airline and you need to determine whether lowering
the prices of tickets will attract more customers in certain days of the week. To solve this problem, you might ask which
months have the most passenger traffic, which weeks, dates or known holidays have
the highest number of passengers? When are tickets typically purchased? Then you form your hypothesis. In this case your hypothesis might be: I predict that Tuesdays and Wednesdays of
a normal business week have the fewest number of passengers and flights. So if the airline lowers prices for
Tuesdays and Wednesdays during or on holiday weeks
then they will sell more tickets. Eventually, you would test your hypothesis
by analyzing the data to understand whether the airline would attract more
customers by lowering the prices on those specific flights. The purpose of asking questions and forming a hypothesis is to
better understand what you want to learn from the data and what
the results of your testing might show. Later when you’re performing other
practices of EDA, you can refer to these questions and your hypothesis to
determine whether you’ve supported or refuted your original theory. To answer the questions and
test the hypothesis you or your team formed you will need a plan. For instance, you might need to contact the subject
matter expert who owns the database or is more familiar with the data source or
you may need to do your own research. In other words leave no stone unturned. It’s an old saying about
an ancient Greek legend and it means to search everything you can to
think of to find what you’re looking for. If you discover that your search for
answers only brings more questions, that’s a good thing! You’re eliminating
the possibility of misinterpreting or misrepresenting the data
each time you learn more. At some point you may need to
decide you need to organize or alter the data to find
the answers to your questions. For example you may need to
regroup entries into months or years rather than days or weeks or
you might want to group customer ages into age ranges to help you
understand trends more effectively. Sometimes combining or splitting
data columns will be necessary for creating models to answer questions. Other times changing date formats or time zones in time bound data
may be all that’s required. For example in your work with
the international airline you were tasked with finding days to enact
lower ticket prices. So the airline could
attract new passengers. Imagine the data you were given listed
ticket prices in US dollars, but the original request was
to lower ticket prices for passengers departing from Europe. One change you would need to make
immediately is to convert US dollars to euros. Making small changes to your data, like
formatting the time, changing a unit or converting the currency is all
part of the discovering process. However, with every change you make stay
focused on the problem you’re assigned or the plan that you established
as part of the pace framework. As we discussed,
every data set is different. Asking questions and
forming a hypothesis will take time and effort but
ultimately answering questions and testing your hypotheses will be the way
you find the stories hidden in your data. If you get stuck, it might help to step
away from your initial discovering work and think through your questions and
hypotheses again. Any visualization rendered, conversion
made, questions answered or hypothesis tested must be
true to that data set story. Who knows? Maybe you’ll find the answer on your walk
break and wouldn’t that be refreshing.

Reading: Reference guide: Datetime manipulation

Reading

Lab: Annotated follow-along guide: Date string manipulations with Python

Video: Date string manipulations with Python

Key Points:

  • Manipulating Date Strings:
    • Convert date strings to datetime objects using pd.to_datetime().
    • Create new columns for different date groupings (e.g., week, month, quarter, year) using strftime().
  • Grouping and Visualizing Data:
    • Group data by desired time periods using groupby().
    • Create bar charts using plt.bar() to visualize data distribution across time periods.
    • Customize charts with labels, titles, formatting, and colors for clarity.
  • Example with Lightning Strike Data:
    • Grouped lightning strikes by week and quarter to analyze patterns.
    • Created bar charts to visualize strike frequency for different time periods.

Additional Insights:

  • Consider the purpose of your analysis when choosing time groupings.
  • Break down complex visualizations into smaller, more digestible charts.
  • Use clear labels and formatting to enhance chart readability.
  • Explore different chart types (e.g., line charts, scatter plots) to visualize trends and relationships.

Understanding Datetimes in Python:

  • Python’s datetime module provides tools for working with dates and times.
  • Key data types:
    • datetime: Represents a specific date and time (e.g., 2024-01-04 09:46)
    • date: Represents a date without time (e.g., 2024-01-04)
    • time: Represents a time without a date (e.g., 09:46:32)

Converting Strings to Datetimes:

  • Use pd.to_datetime() to convert strings to datetime objects: Pythonimport pandas as pd date_strings = ["2023-11-21", "2024-01-03", "2024-02-15"] datetimes = pd.to_datetime(date_strings)

Extracting Components:

  • Access individual components of datetime objects: Pythonfor dt in datetimes: print("Year:", dt.year) print("Month:", dt.month) print("Day:", dt.day) print("Hour:", dt.hour) print("Minute:", dt.minute) print("Second:", dt.second)

Formatting Date Strings:

  • Use strftime() to format datetime objects into custom strings: Pythonformatted_dates = datetimes.strftime("%Y-%m-%d") # Output: 2023-11-21, 2024-01-03, 2024-02-15

Common Format Codes:

  • %Y: Year with century as a decimal number (e.g., 2024)
  • %m: Month as a zero-padded decimal number (e.g., 01)
  • %d: Day of the month as a zero-padded decimal number (e.g., 03)
  • %H: Hour (24-hour clock) as a zero-padded decimal number (e.g., 14)
  • %M: Minute as a zero-padded decimal number (e.g., 46)
  • %S: Second as a zero-padded decimal number (e.g., 32)

Creating Date Ranges:

  • Use pd.date_range() to generate a sequence of dates: Pythondates_range = pd.date_range(start="2023-12-01", end="2024-01-31")

Timedeltas for Date Arithmetic:

  • Use pd.Timedelta() to represent time differences: Pythonone_week = pd.Timedelta(days=7) next_week = datetimes + one_week

Grouping and Analyzing Data:

  • Combine date manipulations with other Python data analysis tools (like Pandas) for powerful insights.
In the statement df['date'].dt.strftime('%Y-W%V'), which element states that the year should be included in the new column format?

%Y

In the statement df[‘date’].dt.strftime(‘%Y-W%V’), the element %Y states that the year should be included in the new column format.

As a data professional, you can expect
to work with date time objects and date strings. In this video,
we’ll continue coding in Python and practice converting,
manipulating, and grouping data. By the end of this video, we’ll create
a widely used data visualization, a bar graph that tells
a story with your data. Working with date strings will often
require breaking them down into smaller pieces. Breaking date strings into days, months,
and years allows you to group and order the other data in different ways so
that you can analyze it. Manipulating date and time strings
is a foundational skill in EDA. In this video,
you will learn to convert date strings in the NOAA lightning strike
dataset into datetime objects. We will discuss how to combine
these data objects into different groups by segments of
time such as quarters and weeks. Let’s open a Python notebook,
and I’ll show you what I mean. Let’s begin by importing
Python libraries and packages. To start, import Matplotlib and
Pandas which you’ve used before. To review,
Pyplot is a very helpful package for creating visualizations like bar,
line, and pie charts. Pandas is one of the more popular packages
in data science because it’s specific focus is a series of functions and
commands that help you work with datasets. The last package, Seaborn,
may be new to you. Seaborn is a visualization
library that is easier to use and produces nicer looking charts. For this video, we’ll use the NOAA
lightning strike data for the years 2016, 2017, and 2018 to group lightning
strikes by different timeframes. This will help us understand total
lightning strikes by week and quarter. As I mentioned at the beginning of this
video, when manipulating date strings, the best thing to do is to break
down the date information like day, month, and year into parts. This allows us to group the data into
any time series grouping we want. Luckily, there’s an easy way to do that
which is to create a datetime object. Now, as you’ll recall
from a previous video, this NOAA dataset has three
columns giving us the date, number of lightning strikes, and
latitude and longitude of the strike. For us to manipulate the date column and
its data, we’ll first need to convert
it into a datetime data type. We do that by simply
coding df (‘date’) and making that equal pd.to_datetime with df (‘date’) input in parentheses. Doing this conversion
gives us the quickest and most direct path to manipulating
the date string in the date column which currently is in the format of
a four digit year followed by a dash, then the two digit month, a dash,
and lastly the two digit day. Okay, this is the exciting part. Because our dates are converted
into Panda’s datetime objects, we can create any sort of
date grouping we want. Let’s say for example, we want to group
the lightning strike data by both week and quarter. All we would need to do is
create some new columns. You’ll see here we’re creating four new
columns, week, month, quarter, and year. With the first line of code,
we’re creating a column week by taking
the data in the date column and using the function strftime. This function from
the daytime package formats are datetime data into new strings. In this case, we want the year followed by
a dash, and the number of weeks out of 52. If we want that string, we need to code it as %Y-W%V. The percent sign is the command which
tells the datetime format to use the year data in the string. The W implies this is a week,
and the V stands for value, as in a sequential
number running from 1 to 52. The final string output for
the column data will be in this format, 2016-W27. The next line of code gives
us the new month column. The argument is then written as a %Y-%m. This will output the four
digit year followed by a dash, then the two digit month. Essentially, we’re removing the last two
digit date from the original date string. Next, we will create a column for
quarters. In this case, a quarter is three months. Many corporations divide their
financial year into quarters. So knowing how to divide data into
quarter years is a very useful skill. In this case,
it only takes one line of code. We’ll call the new column quarter,
and we’ll use our date column with two underscore
period to create the quarter column. The datetime package has a pre-made code
for dividing datetime into quarters. In the two underscore period argument
field, we only need to place the letter q. After that, we can use the function
strftime to complete the string. For the argument, we put %Y-Q%q. The first Q is placed into the string to
indicate we are talking about quarters. The percent sign followed by
the lower case q indicates the Pandas that we want the date
formatted into quarters. Our final column will be
the easiest to code of them all. The year column is created by taking
our original date column data and creating a string that includes
only the argument percent sign Y. This creates a column of data
with it with only the year in it. Now that we have formatted some strings,
let’s quickly review our work by using the head function we learned
in the previous video. When we run this code,
our four new columns are there, week, month, quarter, and year. They are all formatted
just as we discussed. We can use these new strings
to learn more about the data. For example, let’s say we want to group
the number of lightning strikes by weeks. An organization whose employees primarily
work outdoors might be interested in knowing the week to week likelihood
of dealing with lightning strikes. In order to do that,
we’ll want to plot a chart. We’ve reviewed a couple of
charts coded in Python by now. Next, let’s code a chart with
a lightning strike data. For plotting the number of lightning
strikes per week, let’s use a bar chart. Our graph would be a bit confusing
using all three years of data. So, let’s just use the 2018 data and limit our chart to 52 weeks
rather than 156 weeks. We can do this by creating a column
that groups the data by year and then orders it by week. We will then learn more about
the structuring function in another video. For now,
let’s focus on plotting this bar chart. We’ll use the plt.bar function to plot. Within our argument field, we select
the x-axis which is our week column, then the y-axis or height,
which we input as a number of strikes. Next, we’ll fill in some of
the details of our chart. Using plt.plot,
we will place arguments in the x-label, y-label, and title functions. The arguments are week number,
number of lightning strikes, and number of lightning strikes per week
(2018), respectively. This renders a graph, but
the x-axis labels are all joined together. So, we have a chart, but
the x-axis is difficult to read. So let’s fix that. We can do that with
the plt.xticks function. For the rotation, we can put 45, and for
the fontsize, let’s scale it down to 8. After we use plt.show,
the x-axis labels are much cleaner. Given our bar chart illustrating
lightning strikes per month in 2018, you could conclude that a group
planning outdoor activities for weeks 32 to 34 might want
a backup plan to move indoors. Of course, this is a broad
generalization to make on behalf of every North American
location in the dataset. But for our purposes and in general, it is a good understanding
of our dataset to have. For our last visualization,
let’s plot lightning strikes by quarter. For our visualization,
it will be far easier to work with numbers in millions
such as 25.2 million rather than 25, 154, 365, for example. Let’s create a column that divides
the total number of strikes by one million. We do this by typing df_by_quarter, and entering the relevant column
in the arguments field. In this case, we want number of strikes. Next, we add on .div to
get our division function. Lastly for the argument field,
we enter 1000000. When we run this cell, we have a column that provides the number
of lightning strikes in millions. Next, we’ll group the number
of strikes by quarter using the groupby and
reset_index functions. This code divides the number of strikes
into quarters for all three years. Each number is rounded
to the first decimal. The letter m represents one million. As you’ll soon discover, this calculation
will help with the visualization. You’ll learn more about these
functions in another video. We will plot our chart using
the same format as before. We use the plt.bar with
our x being from our df_by_quarter dataframe,
with quarter in the argument field. For the height, we put the number_of_strikes
column in the argument field. It would be helpful if each
quarter had the total lightning strike count at the top of each bar. To do that, we need to define our own
function, which we will call, addlabels. Let’s type addlabels,
then input our two column axes, quarter, and number of strikes
separated by columns and brackets. At the end,
we use the format we created earlier, number_of_strikes formatted to label
the number_of_strikes_by_quarter. To finish the bar chart, we label
the x and y-axis and add the title. Before we show the data visualization, there are a few small things we want to
add just to make it more friendly to read. Let’s set our length and
height to 15 by 5. Next, let’s make the bar
labels cleaner by defining those numbers and centering the text. Our bar chart now gives us the number of
strikes by quarter from 2016 to 2018. To make the information easier to digest,
let’s do one more visualization. Here is the code for
plotting a bar chart that groups the total number of strikes
year over year by quarter. Review the code carefully and
consider what each function an argument does in order to create this
final polished bar chart. Each year has assigned its own color to
highlight the differences in quarters. And now we have our chart. Coming up, you’ll learn more about the
different methods for structuring data. I’ll see you there.

Practice Quiz: Test your knowledge: Understand data format

Which of the following statements will convert the ‘time’ column into a datetime data type?

What Python method formats data into a new string representing date and time using a date, time, or datetime object?

A data professional is creating a bar chart in Python. To label the y-axis Sales to Date, a data professional could use the following statements: plt.ylabel(‘Sales to Date’).

Create structure from raw data


Video: Use structuring methods to establish order in your dataset

This text explains the importance of structuring data in data analytics for exploring and understanding it better. It then breaks down six key methods for structuring data:

  1. Sorting: Arranging data in a meaningful order (e.g., ascending or descending pouch size).
  2. Extraction: Retrieving specific data from a source (e.g., pouch volume and tail length from kangaroo dataset).
  3. Filtering: Selecting a subset of data based on conditions (e.g., kangaroos with tails shorter than 1 meter).
  4. Slicing: Taking subsets of rows and columns for focused analysis (e.g., body length of one regional population).
  5. Grouping: Aggregating individual observations into categories (e.g., grouping tail lengths as long, average, short).
  6. Merging: Combining data from multiple sources aligned by specific columns (e.g., joining two kangaroo datasets).

The text emphasizes the importance of maintaining data integrity throughout these operations and warns against altering the meaning of the data. Finally, it promises to demonstrate these methods in practice using Python in the next section.

Here’s a tutorial on using structuring methods to establish order in your dataset:

Introduction

  • Welcome to this tutorial on structuring methods for dataset organization!
  • Structuring is crucial for unlocking insights and patterns within raw data.
  • It involves techniques to arrange, group, filter, and combine data effectively.
  • By mastering these methods, you’ll enhance your ability to analyze and extract meaningful information from your datasets.

Key Structuring Methods

  1. Sorting:
    • Arrange data in a specific order (ascending, descending, alphabetical, etc.)
    • Example: Sort sales data by revenue to identify top-performing products.
  2. Extracting:
    • Retrieve specific parts of a dataset for focused analysis.
    • Example: Extract customer email addresses for a targeted marketing campaign.
  3. Filtering:
    • Create a subset of data based on certain criteria.
    • Example: Filter product reviews to view only those with 5-star ratings.
  4. Slicing:
    • Select specific rows and columns for analysis.
    • Example: Analyze sales data for a particular region or product category.
  5. Grouping:
    • Aggregate data into categories based on shared characteristics.
    • Example: Group customer data by age range to understand demographic trends.
  6. Merging:
    • Combine datasets from different sources, often using common identifiers.
    • Example: Merge customer data with sales data to create comprehensive profiles.

Demo in Python:

  • Let’s demonstrate these methods using Python’s pandas library:

Python

import pandas as pd

# Sample dataset
data = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 28]})

# Sorting by age (ascending)
data_sorted = data.sort_values('age')

# Extracting names
names = data['name']

# Filtering for ages above 27
data_filtered = data[data['age'] > 27]

# Slicing first two rows
data_sliced = data.iloc[:2]

# Grouping by age
data_grouped = data.groupby('age')

# Merging with another dataset (example)
merged_data = data.merge(other_data, on='name')

Best Practices

  • Choose structuring methods that align with your analysis goals.
  • Consider data types and structures when applying techniques.
  • Use visual aids (charts, graphs) to complement structuring efforts.
  • Maintain data integrity throughout the structuring process.
  • Document your steps for reproducibility and clarity.

Conclusion

  • Structuring is an essential skill for data analysis and exploration.
  • By effectively organizing your datasets, you’ll uncover valuable patterns and insights.
  • Practice these methods regularly to enhance your data analysis toolkit!

Additional Tips

  • Experiment with different combinations of structuring methods for tailored analysis.
  • Utilize tools and libraries (like pandas) to streamline the structuring process.
  • Engage in collaborative data analysis to share knowledge and expertise.
What structuring method enables data professionals to divide information into smaller parts in order to facilitate efficient examination and analysis from different viewpoints?

Slicing

Slicing enables data professionals to break down information into smaller parts.

As a data analytics
professional, you will often need to learn
more about your datasets. This is where the structuring
practice of EDA can help. Let’s explore the
valuable methods that are part of this practice. As you’ll recall from
earlier in the program, structuring helps
you to organize, gather, separate, group, and filter your data in different ways to
learn more about it. Next, we’ll talk about the methods involved in
structuring and later, you’ll practice these
concepts using Python. First on the list of
structuring methods is sorting. Sorting is the process of arranging data into
meaningful order. Imagine that you are given
a dataset about kangaroos. These furry creatures are native to Australia and
Papa New Guinea. They are known for
their strong tails and belly pouch they used to cradle their babies called Joeys. The kangaroo dataset
contains information about kangaroo characteristics
like pouch size, tail length, total body length, and much, much more. The first data we’ll consider
is a data column measuring the volume of the kangaroo
pouches in cubic centimeters. We can sort those values
in ascending or descending order from biggest to smallest
or smallest to biggest. Another useful structuring
tool is extraction. Extracting is the
process of retrieving data from a dataset or source
for further processing, you can think of extraction as retrieving whole
columns of data. An example of extraction is to take the kangaroo
data from before, then evaluate just two
of the columns from the dataset such as pouch
volume and tail length. You can use the
resulting data for analysis, comparisons,
or visualization. Another structuring
method is filtering. Filtering is the process of selecting a smaller
part of your dataset based on specified parameters and using it for
viewing or analysis. You can think of filtering as selecting rows of a dataset. In the case of our
kangaroo dataset, filtering can look like viewing only the
kangaroo pouches of kangaroos that also have
tails shorter than one meter. This is useful in finding meaningful groups or
trends in the data. Next on the list of structuring
methods is slicing. Slicing breaks information
down into smaller parts to facilitate efficient
examination and analysis from
different viewpoints. Think of slicing as an either or both options for
columns and rows. A combination of
extraction and filtering. In the kangaroo dataset, let’s say you have a column of their body length called
total body length. In another column, you have
the kangaroos identified as one of the three different
regional populations. If you were to take
the body length of only one of the three
regional populations, you would be pulling
a slice of the data. Grouping is our next
structuring method. Grouping sometimes
called bucketizing, is aggregating
individual observations of a variable into groups. An example of grouping is
to add a new column called Total Body length next to the kangaroo tail length column. Then group all the tail
lengths into three types, long, average, and short based on the measurements
in the tail length column. You can now find and organize the total body length values based on the kangaroo
tail length groups. The last structuring
method is merging. Merging is a method to combine two different data frames along a specified
starting column. For example imagine we had an additional dataset of kangaroo information from
a different field study, but with the same
parameters and variables. We might use the merge
or join functions to align the columns and combine the new data
into one data set. It’s essential that
you do not change the meaning of the data while
performing your filtering, sorting, slicing, joining,
and merging operations. If for example, we did not merge the kangaroo
pouch measurements correctly with their matching
kangaroo name and ID, the data would not
be representative and our analysis would
be far less than useful. Being true to the data is
being true to its story. Hopefully, you are beginning
to understand the value of organizing and structuring
data in order to analyze it. Coming up, we will practice
structuring using Python.

Reading: Reference guide: Pandas tools for structuring a dataset

Lab: Annotated follow-along guide: EDA structuring with Python

Video: EDA structuring with Python

Structuring Data in Python for Exploratory Data Analysis (EDA)

1. Understanding Data Shape and Cleaning:

  • Import necessary libraries: Pandas, NumPy, Seaborn, datetime, matplotlib.pyplot.
  • Check data shape using df.shape.
  • Identify and remove duplicates using df.drop_duplicates().

2. Sorting and Value Counts:

  • Sort data by column values using df.sort_values().
  • Count occurrences of values in a column using df['column_name'].value_counts().

3. Grouping Data:

  • Create new columns for analysis (e.g., week number, weekday).
  • Group data by specific columns using df.groupby('column_name').

4. Merging DataFrames:

  • Combine multiple datasets using pd.concat([df1, df2]).
  • Ensure consistent formatting before merging.

5. Structuring for Percentage Calculations:

  • Create new columns for year, month, and month_text.
  • Group and aggregate data using .groupby() and NamedAgg.
  • Merge DataFrames to calculate percentages.

6. Data Visualization:

  • Create box plots using Seaborn’s sns.boxplot() to visualize distributions.
  • Create grouped bar charts using sns.barplot() to compare groups.

Key Insights from the Lightning Strike Data:

  • Weekends have lower reported lightning strikes than weekdays.
  • August 2018 had a significantly higher percentage of lightning strikes.
  • Further research into storm and hurricane data could provide context.

Remember: Structuring is essential for uncovering patterns and stories within your data. Practice these techniques to enhance your EDA skills and gain valuable insights.

Introduction:

  • Briefly explain what Exploratory Data Analysis (EDA) is and its importance in data analysis.
  • Define data structuring and its role in EDA.
  • Highlight the benefits of using Python for EDA structuring.

Libraries and Tools:

  • Introduce the essential Python libraries for EDA: Pandas, NumPy, Seaborn, Matplotlib.
  • Explain the purpose of each library and its key functions.

Data Loading and Inspection:

  • Demonstrate how to load data into a Pandas DataFrame using pd.read_csv() or other appropriate methods.
  • Explore the DataFrame’s structure using df.shape, df.head(), df.info(), and df.describe().
  • Check for missing values using df.isnull().sum() and handle them as needed.

Data Cleaning:

  • Address any inconsistencies or errors in the data.
  • Handle missing values using techniques like imputation or removal.
  • Identify and remove duplicate entries using df.drop_duplicates().

Structuring Techniques:

  • Sorting: Demonstrate how to sort data by column values using df.sort_values().
  • Grouping: Show how to group data based on specific columns using df.groupby() and apply aggregate functions.
  • Merging: Explain how to combine multiple datasets using pd.concat() or df.merge().
  • Creating New Columns: Demonstrate how to create new columns based on existing data or calculations.
  • Reshaping Data: Introduce techniques like pivoting and melting for restructuring data.

Applying Structuring for Insights:

  • Provide examples of how structuring can reveal patterns and relationships within data.
  • Demonstrate how to visualize distributions using box plots or histograms.
  • Show how to compare groups using bar charts or line plots.
  • Explore correlations between variables using scatter plots or heatmaps.

Key Points to Remember:

  • Emphasize the importance of choosing appropriate structuring techniques based on the research question and data characteristics.
  • Encourage learners to practice structuring on different datasets to develop their skills.
  • Highlight the benefits of combining structuring with other EDA techniques for a comprehensive understanding of data.

Additional Resources:

  • Suggest online courses, tutorials, or documentation for further learning.
  • Provide links to relevant libraries and documentation.

Conclusion:

  • Summarize the key takeaways of the tutorial.
  • Encourage learners to apply their EDA structuring skills to real-world datasets and projects.

You’ve learned about
how structuring data can help professionals analyze, understand, and learn
more about their data. Now, let’s use a
python notebook, and discover how it
works in practice. We’ll continue using our NOAA
lightning strike dataset. For this video, we’ll
consider the data for 2018, and use our structuring tools
to learn more about whether lightning strikes are more prevalent on some
days than others. Before we do anything else, let’s import our Python
packages, and libraries. These are all packages
and libraries you’re familiar with, Pandas, NumPy, Seaborn, datetime,
matplotlib.pyplot. For a quick refresher, let’s convert our date
column to datetime, and take a look at
our column headers. We do this to get
our dates ready for any future string manipulation
we may want to do, and to remind us of
what is in our data. As you remember, there are three columns in
the dataset; date, number of strikes, and
center point geom, which you’ll find after
running the head function. Next, let’s learn about
the shape of our data by using the df shape function. When we run this cell, we get 3,401,012, 3. Take a moment to picture
the shape of this dataset. We’re talking about only
three columns wide and nearly 3.5 million
cells vertically. That’s incredibly long and thin. We’ll use a function for
finding any duplicates. When we enter df.drop_underscore duplicates with an empty argument field
followed by.shape, the notebook will return
the number of rows, and columns remaining after
duplicates are removed. Because this returns
the same exact number as our shape function, we know there are no
duplicate values. Let’s discuss some of those
structuring concepts we learned about earlier.
Let’s start with sorting. We’ll sort the number
of strikes column in descending value
or most to least. While we do this, let’s consider the dates with the highest number of strikes. We’ll input df sort_values. Then in the argument
field type by, then the equals sign. Next, we input the
column we want to sort, number of strikes followed by ascending equals sign false. If we add the head
function to the end, the notebook outputs the top
10 cells for us to analyze. We find that the
highest number of strikes are in the lower 2000s. It does seem like a lot of lightning strikes
in just one day, but given that it happened in August when storms are likely, it is probable these
2000 plus strikes were counted during a storm. Next, let’s look
at the number of strikes based on the
geographic coordinates, latitude, and longitude. We can do this by using
the value_counts function. We type in df, followed by the
center point geom. Then we type in.value underscore counts with
an empty argument field. Based on our result, we learned that
the locations with the most strikes have
lightning on average, every one in three days, with numbers in the low 100s. Meanwhile, some
locations are reporting only one lightning strike
for the entire year of 2018. We also want to learn if we have an even distribution of values, or whether 108 is a notably high-value for
lightning strikes in the US. To do this, copy the same
value counts function, but input a colon, 20 in the brackets so that you can see the first 20 lines. The rest of the
coding here is to help present the data clearly. We rename the axis, and index to unique values,
and counts respectively. Lastly, we’ll add a gradient background to the counts column
for visual effect. After running the cell, we discover zero
notable drops in lightning strike counts
among the top 20 location. This suggests that
there are zero, notably high lightning
strike data points, and that the data values
are evenly distributed. Next, let’s use another
structuring method, grouping. You’ll often find stories hidden among different
groups in your data. Like the most profitable times a day for retail
store, for instance. For this dataset,
one useful grouping is categorizing lightning
strikes by day of week which will tell us whether
any particular day has fewer or more lightning
strikes than others. Let’s first create
some new data columns. We create a column called week by inputting
df.date.dt.isocalendar. Let’s leave the argument
field blank and add a.week at the end. This will create a
column assigning numbers 1-52 for each of the
days in the year 2018. Let’s also add a column
that names the weekday. Type in df.date.dt.day_name, leaving the argument
field blank. For this last part,
let’s input df.head. Again, you’ll discover
the dates now have week numbers and
assigned weekdays. We have some new columns, so let’s group the number of strikes by weekday to determine whether any particular
day of week has more lightning
strikes than others. Let’s create a DataFrame with just the weekday and number
of lightning strikes. We’ll do this by inputting df, double bracket, weekday, comma, number of strikes both
in single quotes, followed by more
double brackets. Next, we’ll add one of our structuring
functions, groupby, followed by weekday dot mean
within the argument field. What we’re telling the
notebook here is to create a DataFrame with weekday
and number of strikes, but then also group the total number of strikes
by day of the week, giving us the mean number
of strikes for that day. To understand what this
data is telling us, let’s plot a box plot chart. A boxplot is a
data visualization that depicts the locality, spread, and skew of groups
of values within quartiles. For this dataset and notebook, a box plot visualization will be the most
helpful because it will tell us a lot about the distribution of
lightning strike values. Most of the lightning
strike values will be shown as grouped
into colored boxes, which is why this visualization
is called a box plot. The rest of the values
will string out to either side with a
straight line that ends in a t. We will discuss more about box
plots in an upcoming video. Now before we plot, let’s set the weekday order
to start with Monday. Now to code that, input g, equal sign, sns.boxplot. Next, in the argument field, let’s have x equal weekday and
y equal number of strikes. For order, let’s
do weekday order, and for the showfliers
field, let’s input False. Showfliers refers to outliers that may or may not be
included in the box plot. If you input, true,
outliers are included. If you input false, outliers are left off
the box plot chart. Keep in mind, we aren’t deleting any outliers from the dataset
when we create this chart, we’re only excluding them
from our visualization to get a good sense of the
distribution of strikes across the
days of the week. Lastly, we will plug in
our visualization title, lightning distribution
per weekday for 2018 and click run cell. Now you’ll discover something
really interesting. The median, indicated by these horizontal black
lines remains the same on all of the
days of the week. As for Saturday and
Sunday, however, the distributions are both lower than the rest of the week. Let’s consider why that is. What do you think
is more likely? That lightning strikes across the United States
take a break on the weekends or that people do not report as many lightning
strikes on weekends? While we don’t know for sure, we have clear data suggesting the total quantity of weekend lightning strikes
is lower than weekdays. We’ve also learned a story
about our dataset that we didn’t know before we tried
grouping it in this way. Let’s get back into
our notebook and learn some more about
our lightning data. One common structuring
method we learned about in another
video was merging, which you’ll remember means combining two different
data sources into one. We’ll need to know
how to perform this method in Python if we want to learn more about our
data across multiple years. Let’s add two more years to
our data, 2016 and 2017. To merge three years
of data together, we need to make sure each
dataset is formatted the same. The new datasets do not have the extra columns week and weekday that we created earlier. To merge them successfully, we need to either remove the new columns or add
them to the new datasets. There’s an easy way to
merge the three years of data and remove the extra
columns at the same time. Let’s call our new
data frame union_df. We’ll use the pandas
function concat to merge or more accurately concatenate
the three years of data. Inside the concat argument
field we’ll type in df.drop to pull the weekday
and weak columns out. We also input the axis we want to drop,
which is one. Lastly, and most essentially, we add the data
frame name we are concatenating to, df_2. We also input true
for ignore index because the two data
frames will already align along their first columns and now you’ve just learned to
merge three years of data. To help us with the next
part of structuring, create three date columns following the same steps
you used previously. We’ve already added
the columns for year, month, and month_text
to the code. Now let’s add all the
lightning strikes together by year so
we can compare them. We can do this by simply taking the two columns
we want to look at, year and number of
strikes and group them by year with the
function.sum on the end. You’ll find that 2017 did have fewer total strikes
in 2016 and 2018. Because the totals
are different, it might be interesting as
part of our analysis to see lightning strike percentages
by month of each year. Let’s call this lightning
by month grouping our union data frame by
month, text and year. Additionally, let’s
aggregate the number of strikes column by using the
pandas function NamedAgga. In the argument field, we place our column name and our aggregate function
equal to some, so that we get the totals for each of the months
in all three years. When we input the head function, we have the months in
alphabetical order, along with the sums
of each month. We can do the same
aggregation for year and years strikes to review those same numbers
we saw before with 2017 having fewer strikes
than the two other years. We created those
two data frames, lightning by month and
lightning by year in order to derive our percentages of lightning strikes
by month and year. We can get those percentages
by typing lightning by month.merge with
lightning by year, on equal sign year in
the argument field. You’ll find that
the merge function is merging lightning by year into our
lightning by month data frame according
to the year. Lastly, we can create a percentage lightning per
month column by dividing the percentage lightning.number
of strikes by percentage lightning
after which we’ll add the asterix 100 to
give us percentage. Now, when we use
our head function, we have a restructured
data frame. To more easily review our
percentages by month, let’s plot a data visualization. For this one, a simple grouped
bar graph will work well. We’ll adjust our figure
size to 10 and 6 first. Then we use the seaborne
library bar plot with our x-axis as month texts and our y-axis as percentage
lightning per month. For some color, we’ll have
our hue change according to the year column with the data following the month
order column. Finally, let’s input our x and y labels and our title
and run the cell. When you analyze the bar chart, August 2018 really stands out. In fact, more than one-third of the lightning strikes for 2018 occurred in
August of that year. The next step for a data professional trying
to understand these findings might be to research storm and
hurricane data, to learn whether those
factors contributed to a greater number of lightning strikes for
this particular month. Now that you’ve learned some of the Python code for the EDA
practice of structuring, you’ll have time to
try them out yourself. Good luck finding those
stories about your data.

Fill in the blank: A box plot is a data visualization that depicts the locality, spread, and _____ of groups of values within quartiles.

skew

A box plot is a data visualization that depicts the locality, spread, and skew of groups of values within quartiles. Box plots provide information on the variability and dispersion of data by depicting how the values in the data are spread out.

Reading: Histograms

Lab: Exemplar: Structure your data

Practice Quiz: Test your knowledge: Create structure from raw data

Fill in the blank: Grouping is a structuring method that enables data professionals to _____ individual observations of a variable into different categories or classes.

Which of the following Python statements will create a list called grade_order that starts with Preschool?
1 point
grade_order = [‘Preschool’, ‘Kindergarten’, ‘Elementary School’, ‘Middle School’, ‘High School’]

order_grade = [‘Preschool’, ‘Kindergarten’, ‘Elementary School’, ‘Middle School’, ‘High School’]

grade_order (‘Preschool’, ‘Kindergarten’, ‘Elementary School’, ‘Middle School’, ‘High School’)

order = [‘Preschool_Grade’, ‘Kindergarten_Grade’, ‘Elementary School_Grade’, ‘Middle School_Grade’, ‘High School_Grade’]

A data professional can use the concat function to join two or more dataframes.

Review: Explore raw data


Video: Wrap-up

  • The content on this page discusses the process of building a puzzle and how it relates to data analysis.
  • It highlights the progress you’ve made in the course so far, including learning about data sources, types, and formats, as well as basic visualizations.
  • The importance of cleaning and organizing data is emphasized, as it is a crucial step in uncovering insights and trends.
  • The page also mentions the challenges of merging data from multiple sources and tables, using a real-life example from healthcare consulting.
  • Workplace skills such as communication, hypothesis testing, and storytelling with data are also covered.
  • The page concludes by mentioning upcoming topics, including dealing with missing data and outliers, and using visualizations to tell a data story.

You know that moment in
the puzzle building process, when you’re starting to see
the full picture take shape? Like, maybe you have the frame pieces connected,
but still a way to go. So far you’ve learned about
some practices of EDA. You’ve learned how to gather,
analyze, organize and structure data. Coming up, you will continue to
put these pieces together and the picture will become clear. You’re making great progress. You’ve learned about the data sources,
data types and data formats and the importance of
knowing the basics about your data. We’ve worked through how to use
Python to uncover big picture understandings of your datasets
like column headers, D types, size and
shapes as well as basic visualizations. Along with these parts of EDA and
discovering, we’ve learned about date and time transformations in Python. As for the EDA practice of structuring, you’ve learned to make order from
chaos with functions like sorting, extracting, filtering, slicing,
joining, merging and grouping. In python you practice applying these
functions to datasets that are similar to those you might work with in your
career as a data professional. Data professionals commonly
use each of these concepts. You will continue to build your
skills with them as you search for the stories hidden in data. As a data professional I
can say that cleaning and organizing your dataset is 90% of
the battle and once you structured your tables uncovering insights and
trends can be a walk in the park. I mentioned earlier how I used to work as
a data analyst in healthcare consulting and would analyze vast amounts
of medical record data to help recommend treatments for
patients with severe illnesses. Typically data is hosted across
multiple different sources and tables, making them difficult to merge together. Plus medical records may be organized so
that medications a patient took and their corresponding
conditions are separated. So in order for me to understand what
type of treatment a patient took for which illness,
the side effects they experienced and whether the treatment worked I had to
merge hundreds of tables together. Once I did, it was incredibly easy for
me to compare and contrast the different types of treatments
taken and its impact on each patient. You started to learn the skills
needed to complete similar tasks. Along the way, you also learn some key
workplace skills like the timing for communicating updates and
posing questions to project stakeholders, managers and subject matter experts. We also talked about making and
testing hypotheses on your datasets, which narrows the scopes and sharpens
the detail of your data driven stories. In short, you are understanding more and more what it means to
perform EDA on a dataset. Great work. Later in the program we’ll
discuss how to put together the rest of the data story by learning what to do
with missing data and outliers as well as making and using visualizations to help
tell that story, I’ll meet you there.

Reading: Glossary terms from module 2

Terms and definitions from Course 3, Module 2

Quiz: Module 2 challenge

What are some strategies data professionals use to understand the source of a dataset? Select all that apply.

What are some of the benefits of J-SON files for data professionals? Select all that apply.

What type of data is gathered outside of an organization and aggregated?

Which of the following statements correctly uses the head() function to return the first 5 rows of a dataset?

Which of the following statements will assign the name Kuwait Museums to a bar graph in Python?

Fill in the blank: The Python function fig.show() is used to render a _____ of a plot.

Which structuring method selects a smaller part of a dataset based on specified parameters, then uses it for analysis?

Fill in the blank: A box plot is a data visualization that depicts the locality, skew, and _____ of groups of values within quartiles.