Skip to content
Home » Google Career Certificates » Google Data Analytics Professional Certificate » Course 3: Prepare Data for Exploration » Week 1: Data types and structures

Week 1: Data types and structures

We all generate lots of data in our daily lives. In this part of the course, you’ll check out how we generate data and how analysts decide which data to collect for analysis. You’ll also learn about structured and unstructured data, data types, and data formats as you start thinking about how to prepare your data for exploration.

Learning Objectives

  • Explain how data is generated as a part of our daily activities with reference to the types of data generated
  • Explain factors that should be considered when making decisions about data collection
  • Explain the difference between structured and unstructured data
  • Discuss the difference between data and data types
  • Explain the relationship between data types, fields, and values
  • Discuss wide and long data formats with references to organization and purpose

Data exploration


Video: Introduction to data exploration

This course will teach you how to prepare data for analysis, a crucial step in the data analysis process. You will learn to identify how data is generated and collected, and you will explore different formats, types, and structures of data. You will also learn how to choose and use data that will help you understand and respond to a business problem, and how to analyze data for bias and credibility. Additionally, you will learn what clean data means and how to extract data from a database using spreadsheets and SQL. Finally, you will learn the basics of data organization and data protection.

Here is a more detailed overview of each topic:

  • Identifying how data is generated and collected: You will learn about the different ways that data is generated and collected, such as through surveys, sensors, and social media. You will also learn about the different types of data, such as structured, unstructured, and semi-structured data.
  • Exploring different formats, types, and structures of data: You will learn about the different formats that data can be stored in, such as CSV, JSON, and XML. You will also learn about the different types of data, such as numerical, categorical, and text data. Finally, you will learn about the different structures that data can have, such as tables, hierarchies, and graphs.
  • Choosing and using data to understand and respond to a business problem: You will learn how to choose the right data for your analysis based on the business problem that you are trying to solve. You will also learn how to use data to answer your questions and to make recommendations.
  • Analyzing data for bias and credibility: You will learn how to identify and mitigate bias in data. You will also learn how to assess the credibility of data.
  • Understanding clean data: You will learn what clean data is and why it is important. You will also learn how to clean data.
  • Extracting data from a database using spreadsheets and SQL: You will learn how to use spreadsheets and SQL to extract data from a database.
  • Learning the basics of data organization and data protection: You will learn how to organize data in a way that is efficient and easy to use. You will also learn how to protect data from unauthorized access and corruption.

By the end of this course, you will have the skills necessary to prepare data for analysis in a variety of settings. You will be able to choose and use the right data for your analysis, clean the data, and extract data from a database. You will also be able to organize and protect your data.

Introduction to Data Exploration

Data exploration is the process of analyzing data to discover patterns, trends, and insights. It is an essential step in any data science project, as it allows you to understand your data and identify the most important questions to ask.

There are a number of different techniques that can be used for data exploration, but some of the most common include:

  • Visualizations: Visualizations are a great way to get a quick overview of your data and to identify any obvious patterns or trends. Some common visualizations include histograms, bar charts, line charts, and scatter plots.
  • Summary statistics: Summary statistics can be used to get a more detailed understanding of your data. Some common summary statistics include the mean, median, mode, and standard deviation.
  • Grouping and aggregation: Grouping and aggregation can be used to identify patterns in your data that are not immediately obvious. For example, you could group your data by customer age and then calculate the average purchase amount for each age group.
  • Anomalies: Anomalies are data points that are significantly different from the rest of the data. Identifying anomalies can be helpful for identifying fraud or other problems.

Here is a step-by-step guide to data exploration:

  1. Gather your data: The first step is to gather the data that you want to explore. This data could come from a variety of sources, such as a database, a spreadsheet, or a log file.
  2. Clean your data: Once you have your data, you need to clean it. This means removing any errors or inconsistencies in the data.
  3. Explore your data: Once your data is clean, you can start to explore it. This can be done using a variety of techniques, such as visualizations, summary statistics, grouping and aggregation, and anomalies.
  4. Identify patterns and trends: As you explore your data, look for patterns and trends. These patterns and trends can help you to understand your data and to identify the most important questions to ask.
  5. Formulate hypotheses: Once you have identified some patterns and trends, you can start to formulate hypotheses. Hypotheses are statements about your data that you can test using further analysis.

Data exploration is an iterative process. As you learn more about your data, you may need to go back and explore it again using different techniques.

Here are some tips for data exploration:

  • Start with a clear goal: What do you want to learn from your data? Once you know your goal, you can focus your exploration on the most relevant data and techniques.
  • Be creative: There are many different ways to explore data. Don’t be afraid to experiment with different techniques and visualizations.
  • Use a variety of tools: There are a number of different tools available for data exploration. Use the tools that are most comfortable for you and that are best suited for your data.
  • Collaborate with others: Data exploration can be a challenging task. Don’t be afraid to collaborate with others, such as data scientists or analysts.

Data exploration is an essential skill for any data scientist or data analyst. By following the tips above, you can learn how to explore your data effectively and to discover valuable insights.

Picture this: You’re working on a project. You’ve asked all the right questions,
applied structured thinking, and you’re completely in sync
with your stakeholders. You’re off to a great start. But there’s another step in the process:
preparing the data correctly. This is where understanding
the different types of data and data structures comes in. Knowing this lets you figure out
what type of data is right for the question you’re answering. Plus, you’ll gain practical
skills about how to extract, use, organize, and protect your data. Hey, my name is Hallie, and
I’m an analytical lead at Google. I work with companies in
the healthcare industry. I’m so
excited to welcome you to this course. You’ve been building up your data analyst
skills in lots of different ways so far. You’ve learned how to ask the right
questions, define the problem, and present your analysis in a way that matches up
with the needs of your stakeholders. In other words, you’ve learned
how to tell a story using data. Now we’ll learn more about the data
that you’ll need to tell the best story possible. But before we do that,
I’d love to tell you my story. I use analytics to help healthcare
companies develop digital marketing solutions that make their business and
their brands stronger. My team and I find business and media opportunities based on
the latest industry and data insights. I’ve been working in healthcare for
about five years, and it’s great. I really enjoy being able to use
data to help spark change in such an important industry. As you’ll discover in this course, data can be the main character
in a very powerful story. I absolutely love using analysis to tell
that story in a way that’s compelling and informative. Here’s a real life example of how
I’ve used data to tell a story. In my job, we analyze Medicare
enrollment data over time and make connections to how people
research Medicare plans on Google. As people 65 and older become more
informed decision makers for their health, I use the data to learn if there’s
an increase in Medicare enrollments and what part Google searches play
if there is an increase in demand. Now it’s very important that I make
sure the data is relevant and valid. I also have to pay attention
to questions around access and equity while maintaining the privacy
of those conducting searches. The happy ending of my story is that
the data in my findings is useful to medical professionals and their patients. There’s so much useful data out there, and
you’re building the skills you’ll need to find and
use the right data in the best way. In this course,
you’ll continue sharpening those skills. So you’ve already heard a lot about
the data analysis process steps: Ask, Prepare, Process,
Analyze, Share and Act. Now it’s time to learn
how to prepare the data. You’ll learn to identify how data
is generated and collected, and you’ll explore different formats,
types and structures of data. We’ll make sure you know how to choose and
use data that’ll help you understand and respond to a business problem. And because not all data fits each need,
you’ll learn how to analyze data for bias and credibility. We’ll also explore what clean data means. But wait, there’s more. You’ll also get up close and
personal with databases. We’ll cover what they are and
how analysts use them. You’ll even get to extract your
own data from a database using a couple of tools that you’re already
familiar with: spreadsheets and SQL. The key here is patience. Like anything worth doing,
this will take time and practice. And I’ll be with you
every step of the way. Still with me? Great. The last few things we’ll cover
are the basics of data organization and the process of protecting your data. Data works best when it’s organized. And if you’re organizing your data,
you’ll want to protect it too. I’ll show you how to do both and
apply it to your own analysis. I’m so excited to help you write
your own personal story as you continue exploring the world
of data analytics. So let’s do it.

Reading: Course syllabus

Overview

Video: Hallie: Fascinating data insights

Hallie is an analytical lead at Google who is passionate about using data to help the healthcare industry. She started her career analyzing large sums of patient data, and she has since expanded her skills to include user research and understanding how users search on Google.

Hallie believes that the most important skill in data analysis is creativity. She enjoys piecing together nuggets of information to create a narrative using data. This skillset is essential for identifying trends and patterns in data, and for communicating those findings to others in a clear and concise way.

In her role at Google, Hallie works with healthcare companies to help them understand the industry and to better market to their target audiences. She also conducts research on the healthcare industry to identify new trends and opportunities.

Hallie’s work has a direct impact on the healthcare industry. By helping healthcare companies to use data more effectively, she is helping them to improve their marketing campaigns, to develop new products and services, and to provide better care to their patients.

Here are some specific examples of how Hallie’s work has helped the healthcare industry:

  • She developed a new data-driven approach to marketing that has helped healthcare companies to increase their brand awareness and to reach more potential customers.
  • She created a report on the latest trends in healthcare search that has helped healthcare companies to better understand the needs of their patients and to develop more targeted marketing campaigns.
  • She worked with a team of engineers to develop a new tool that helps healthcare providers to identify patients who are at risk of developing certain diseases. This tool has helped healthcare providers to intervene early and to prevent patients from developing serious health problems.

Hallie is a valuable asset to the healthcare industry. She is a skilled data analyst and a creative thinker. She is also passionate about using data to make a positive impact on the world.

Healthcare is just a really
fascinating place in the US. It’s a really incredible
industry to work in because it is so
historically traditional, and healthcare companies,
unlike other tech companies, just really have not used
data to inform decisions. When I was in college, I had a professor who didn’t want us to have
textbooks because he just said the healthcare
industry was changing so rapidly, and it wouldn’t make
sense to have a textbook, which is just a
static piece of text when things were just
really evolving. So I would say healthcare
and data and the two together is a newer
concept using big data, using machine learning, and artificial intelligence to help the healthcare
industries. I started analyzing large
sums of patient data. That was the first
time I had really worked with such huge datasets, and I found it really
fascinating that we can take all of
these datasets and synthesize them and
allow us to really deliver some cool insights and trends to our
hospital systems. That was the first time I started thinking about data analysis, data analytics, as a
possible career for me. That’s really what brought me to this analytical lead role at
Google where I could take that knowledge and
that skill set of analyzing datasets and do
that on a daily basis, so that really, every
conversation I was having with the client was a
data-informed conversation. I work within the
healthcare vertical. We have companies who market on our platforms, like Google
Search and YouTube. We help them understand the healthcare industry
so that they can better market to the audience that they’re trying to reach. Whether you’re a
healthcare insurer or you’re a health care provider, maybe a hospital system, they all have different needs
on how they want to reach their audience using
Google’s platforms. We help them optimize
their marketing spend, but we also do a lot of research in the
healthcare industry. Some user research, some understanding of how
users are really just searching on Google
to give them a sense of what’s really happening in the industry and how they
can market effectively. I would say that my technical skills with data
analytics came with time. The most important skill I found, which has also come with
time and grown with me, is just the creativity
side of data analysis. I mean, you can really learn a lot of the SQL skills and R, and I know some of that
is within the course. But really, the
creativity side is something that just
comes with experience. When you’re looking at a dataset, you might look at it one way
and analyze it one way and then have someone else look at it or look at it a week later, and then all of a
sudden the trend that you’re seeing is
completely different. You have to take a
lot of these pieces of information, these nuggets, I like to call them,
and just piece together a really nice
narrative using data. That skill set is something I learned when I was
working in consulting, and I’ve taken that to Google and really been able to polish a lot of those skills and some of the more
technical skills. Technical and the creative side are what I’ve grown to love. My name is Hallie. I’m
an analytical lead at Google working specifically
in the healthcare vertical.

Reading: Deciding if you should take the speed track

Overview

Practice Quiz: Optional: Familiar with data analytics? Take our diagnostic quiz

A data analyst at a construction company is working on a report for a quickly approaching deadline. Why might they choose to analyze only historical data?

What are the benefits of data modeling? Select all that apply.

A group of high school students take a survey that asks,” Are you on an athletic team? Please reply yes or no.” What kind of data is being collected?

A data analyst is evaluating data to determine whether it is good or bad. Which qualities characterize good data? Select all that apply.

Imagine that a company uses your personal data as part of a financial transaction. Before it occurs, you are not made aware of the nature and scale of this transaction. What concept of data ethics does this violate?

Which of the following are protections afforded by data privacy? Select all that apply.

Which of the following are uses of relational databases? Select all that apply.

Which statements define primary keys and foreign keys and describe their relationship? Select all that apply.

What tasks can data analysts accomplish using metadata? Select all that apply.

A data analyst reviews a spreadsheet of boat auction sales to find the last five sailboats sold in Kentucky. What steps would they take in order to narrow the scope? Select all that apply.

You are writing a SQL query to filter data from a database that describes trees in Omaha, Nebraska. You want to only display entries for trees that have a diameter of 30 inches. The name of the table you’re using is Nebraska_trees and the name of the column that shows the diameters of the trees is trunk_diameter. What is the correct query syntax that will retrieve and filter data from this table?

Consistent naming conventions describe which properties of a file? Select all that apply.

Collecting data


Video: Data collection in our world

Data is being generated all around the world at an unprecedented rate, and there are many different ways that it can be generated and collected. Some common methods include:

  • Online activity: Every time we search the web, watch a video, or post on social media, we are generating data. This data can be collected by companies and organizations to learn more about our interests and habits.
  • Surveys and questionnaires: Surveys and questionnaires are a common way to collect data from people directly. They can be used to learn about people’s opinions, experiences, and behaviors.
  • Interviews: Interviews can be used to collect data from people in more depth. They can be used to get more personal and detailed information than surveys or questionnaires.
  • Scientific observations: Scientists collect data by observing the world around them. This can include observing animal behavior, studying bacteria under a microscope, or conducting experiments.
  • Forms: Forms are often used to collect data from people in person or online. They are commonly used by businesses and government agencies to collect information about customers, employees, and citizens.

Data can be used for a variety of purposes, including:

  • To improve products and services: Companies can use data to learn more about what their customers want and need. This information can then be used to improve products and services.
  • To make better decisions: Data can be used to make informed decisions about everything from business strategy to public policy.
  • To conduct research: Scientists and other researchers use data to learn more about the world around us and to solve problems.

It is important to note that data collection and generation should be done in an ethical and responsible manner. This means respecting people’s privacy and ensuring that data is used for its intended purpose.

Data analytics is the process of collecting, cleaning, analyzing, and interpreting data to gain insights and make better decisions. Data collection is the first step in the data analytics process, and it is important to collect the right data in order to get meaningful results.

There are many different ways to collect data, and the best method will depend on the specific needs of the data analysis project. Some common data collection methods include:

  • Online surveys: Online surveys are a convenient and efficient way to collect data from a large number of people. They can be used to collect data on a variety of topics, such as demographics, opinions, and behaviors.
  • Mobile surveys: Mobile surveys are similar to online surveys, but they are designed to be taken on mobile devices. This makes them a good option for collecting data from people who are on the go.
  • In-person interviews: In-person interviews allow researchers to collect more detailed and nuanced data than surveys. They can also be used to build relationships with participants and to get a better understanding of their perspectives.
  • Focus groups: Focus groups are small groups of people who are brought together to discuss a particular topic. They can be used to generate new ideas, to explore different perspectives, and to get feedback on products or services.
  • Observation: Observation is a data collection method that involves watching and recording people’s behavior. It can be used to collect data on a variety of topics, such as how people use products, how they interact with each other, and how they respond to different stimuli.
  • Sensor data: Sensors can be used to collect data on a variety of environmental factors, such as temperature, humidity, and air quality. Sensor data can be collected in real time and can be used to track changes over time.

Once the data has been collected, it needs to be cleaned and analyzed. Data cleaning involves removing any errors or inconsistencies from the data. Data analysis involves using statistical and visualization tools to identify patterns and trends in the data.

The results of the data analysis can then be used to make informed decisions about a variety of topics, such as how to improve products and services, how to allocate resources, and how to develop new policies.

Here are some tips for collecting data effectively:

  • Define your goals: What do you hope to learn from the data you collect? Once you know your goals, you can choose the right data collection method and collect the right data.
  • Identify your target audience: Who are you trying to collect data from? Once you know your target audience, you can choose a data collection method that is likely to reach them.
  • Choose the right data collection method: There are many different data collection methods available. Choose a method that is appropriate for your target audience and that will allow you to collect the data you need.
  • Pilot test your data collection instrument: Before you launch your data collection effort, pilot test your data collection instrument (e.g., survey questionnaire, interview guide, observation checklist) with a small group of people to ensure that it is clear, easy to understand, and that it will collect the data you need.
  • Collect data ethically: Be sure to collect data in an ethical and responsible manner. This means respecting people’s privacy and ensuring that they understand how their data will be used.

Example of data collection in data analytics:

A company that sells e-commerce software wants to learn more about how their customers use their product. They develop an online survey and ask their customers to complete it. The survey asks customers about their demographics, how they use the software, and what they like and dislike about it.

The company then analyzes the survey results. They find that their customers are mostly small businesses and that they use the software to manage their inventory and orders. The customers also appreciate the software’s ease of use and its customer support.

The company uses the results of the survey to improve their product and to develop new features that their customers will find valuable.

Data collection is an essential part of the data analytics process. By collecting the right data and using it effectively, organizations can gain valuable insights and make better decisions.

To track people's online activities and interests, which method of data collection is most effective?

Cookies

To track people’s online activities and interests, cookies are most effective.

Right now data is being
generated all around the world and we’re
talking tons of data. Every minute of every
day millions of texts and hundreds of
millions of emails are sent. On top of that, millions
of online searches are made and videos viewed and those numbers
are only growing. That’s a lot of data. Let’s learn more about
how it’s made and used. In this video, we’ll talk about
the ways that data can be generated and how industries
collect data themselves. Every piece of
information is data. All that data is usually generated as a result of
our activity in the world. These days, we spend
a lot of time online. With social media
and mobile devices, millions and millions
of people are adding to the huge amount of data out
there, each and every day. Think about it like this. Every digital photo online
is one piece of data. Every photo itself
holds even more data, from the number of pixels to the colors contained in
each of those pixels. But that’s not the only
way data is made. We can also generate data
by collecting information. This data generation and collection comes with a few
more things to think about. It needs to be done
with consideration to ethics so that we maintain
people’s rights and privacy. We’ll learn more
about that later on. For now, let’s check out
a real world example. The United States
Census Bureau uses forms to collect data about
the country’s population. This data is used for
a number of reasons, like funding for schools, hospitals, and fire departments. The Bureau also collects information about things
like U.S. businesses, creating their own
data in the process. The great thing about this
is that others can then use the data for their own
needs, including analysis. The annual business survey is used to figure
out the needs of businesses and how to provide them with resources
to help them succeed. I actually generate data in the analytics I do for
the health care industry. We run a lot of surveys
to learn how patients feel about certain things
related to their health care. For example, one survey
asked how patients feel about telemedicine versus
in-person doctor visits. The data we collected help
the companies we work with improve the care that
their patients receive. Survey data is just one example. There’s all kinds of data
being generated all the time, and there’s lots of
different ways to collect it. Even something as simple as an interview can help
someone collect data. Imagine you’re
in a job interview. To impress the hiring manager, you want to share
information about yourself. The hiring manager
collects that data and analyzes it to help them decide whether to hire you or not. But it goes both ways. You could also collect
your own data about the company to help you decide if the company is a
good fit for you. Or you can use the data
you collect to come up with thoughtful questions
to ask the interviewer. Scientists also generate data. They use a lot of
observations in their work. For example, they might
collect data by studying animal behavior or looking at bacteria under a microscope. Earlier we talked
about the forms that the U.S. Census Bureau
uses to collect data. Forms, questionnaires
and surveys are commonly used ways to
collect and generate data. One thing to note:
data that’s generated online doesn’t always
happen directly. Have you ever wondered why
some online ads seem to make really accurate suggestions or how some websites remember
your preferences? This is done using cookies, which are small files stored on computers that contain
information about users. Cookies can help inform
advertisers about your personal interests and habits based on your
online surfing, without personally
identifying you. As a real world analyst, you’ll have all kinds
of data right at your fingertips and
lots of it too. Knowing how it’s
been generated can help add context to the data, and knowing how to
collect it can make the data analysis
process more efficient. Coming up, you’ll
learn how to decide what data to collect
for your analysis. So stay tuned.

Video: Determining what data to collect

When collecting data for a data analytics project, there are a number of factors to consider, including:

  • What data is needed? The first step is to identify the specific data that is needed to address the problem or question being investigated.
  • How will the data be collected? There are a variety of data collection methods available, such as surveys, interviews, observations, and sensor data. The best method will depend on the specific needs of the project.
  • Where will the data come from? Data can be collected from a variety of sources, such as first-party data (collected by the individual or organization conducting the analysis), second-party data (collected by another organization and sold to the individual or organization conducting the analysis), and third-party data (collected by an organization that is not directly involved in the analysis). The source of the data will affect its reliability and accuracy.
  • How much data is needed? It is important to collect enough data to be statistically significant, but too much data can be difficult and expensive to manage and analyze.
  • What data type is needed? There are a variety of data types, such as numerical, categorical, and text data. The type of data needed will depend on the specific analysis being performed.
  • What time frame is needed for data collection? The time frame for data collection will depend on the specific needs of the project. If an immediate answer is needed, historical data can be used. However, if tracking patterns over time is needed, a longer time frame may be necessary.

It is important to note that data collection should be done in an ethical and responsible manner. This means respecting people’s privacy and ensuring that data is used for its intended purpose.

Tutorial: Determining what data to collect in data analytics

Data collection is the first step in the data analytics process. It is important to collect the right data in order to get meaningful results. However, with a nearly endless amount of data available, it can be difficult to know what data to collect.

Here are some tips for determining what data to collect in data analytics:

  1. Start by identifying the business question or problem that you are trying to solve. What do you hope to learn from the data? Once you know your goal, you can start to think about what data you need to collect in order to achieve it.
  2. Consider your audience. Who are you trying to learn about? Once you know your audience, you can start to think about what data is relevant to them.
  3. Think about the different types of data that are available. There are many different types of data, such as demographic data, behavioral data, and attitudinal data. The type of data you need to collect will depend on your business question or problem.
  4. Consider the quality of the data. Not all data is created equal. Some data is more reliable and accurate than others. When choosing data, it is important to consider its quality.
  5. Think about the cost and time required to collect the data. Collecting data can be expensive and time-consuming. It is important to weigh the costs and benefits of collecting different types of data before making a decision.

Here are some examples of how to apply these tips:

  • Example 1: A company that sells e-commerce software wants to learn more about how their customers use their product. They identify the following business question: “What features of our product are our customers using most often?” To answer this question, they need to collect data on how customers use the product. This data could include information such as which features customers click on most often, how long they spend using each feature, and whether they complete tasks using the features.
  • Example 2: A political campaign wants to learn more about the opinions of voters in their district. They identify the following business question: “What are the most important issues to voters in our district?” To answer this question, they need to collect data on voters’ opinions on a variety of issues. This data could be collected through surveys, interviews, or focus groups.

Once you have collected the data, you can start to analyze it to identify patterns and trends. This information can then be used to make informed decisions about your business or organization.

Here are some additional tips for determining what data to collect:

  • Start small. It is better to start with a small amount of data that is relevant to your business question or problem than to collect a large amount of data that is not relevant.
  • Be specific. When choosing data, be as specific as possible. For example, instead of collecting data on “customer satisfaction,” collect data on “customer satisfaction with the checkout process.”
  • Be flexible. As you learn more about your business question or problem, you may need to adjust the data that you are collecting. Be prepared to adapt your data collection strategy as needed.

By following these tips, you can determine what data to collect in data analytics and get the most out of your data analysis.

In instances when collecting data from an entire population is challenging, data analysts may choose to use what?

A sample

In instances when collecting data from an entire population is challenging, data analysts may choose to use a sample. A sample is a part of a population that is representative of that population.

The data-collection process involves deciding what data to use, determining how much data to collect, and selecting the right data type. Which of the following are also steps in the data-collection process? Select all that apply.
  • Determining the time frame
  • Choosing data sources

Determining the time frame and choosing data sources are steps in the data collection process.

Welcome back. We’ve talked a lot about all the data out
there in the world. But as a data analyst, you’ll need to decide
what kind of data to collect and use
for every project. With a nearly endless
amount of data out there, this can be quite a bit of a data dilemma, but
there’s good news. In this video, you’ll learn which factors to consider
when collecting data. Usually, you’ll have
a head start in figuring out the right
data for the job, because the data you need
will be given to you, or your business task or problem will narrow down your choices. Let’s start with a question like, what’s causing increased
rush hour traffic in your city? First, you need to know how
the data will be collected. You might use observations
of traffic patterns to count the number of cars
on city streets during particular times. You notice that cars are getting backed up on a specific street. That brings us to data sources. In our traffic example, your observations would
be first-party data. This is data collected by an individual or group
using their own resources. Collecting first-party
data is typically the preferred method because you know exactly where it came from. You might also have
second-party data, which is data
collected by a group directly from its
audience and then sold. In our example, if you aren’t able to collect your own data, you might buy it from
an organization that’s led traffic pattern
studies in your city. This data didn’t start with you, but it’s still reliable
because it came from a source that has experience
with traffic analysis. The same can’t always be said
about third-party data or data collected from outside sources who did not
collect it directly. This data might have
come from a number of different sources before
you investigated it. It might not be as reliable, but that doesn’t mean
it can’t be useful. You’ll just want to make
sure you check it for accuracy, bias, and credibility. Actually, no matter what
kind of data you use, it needs to be inspected for accuracy and trustworthiness. We’ll learn more about
that process later. For now, just remember that
the data you choose should apply to your needs, and it
must be approved for use. As a data analyst, it’s your job to decide
what data to use, and that means choosing the data that can help you
find answers and solve problems and not getting
distracted by other data. In our traffic example, financial data probably
wouldn’t be that helpful, but existing data about high volume traffic
times would be. Okay. Now let’s talk about
how much data to collect. In data analytics, a
population refers to all possible data values
in a certain data set. If you’re analyzing data
about car traffic in a city, your population would be
all the cars in that area. But collecting data from the entire population can
be pretty challenging. That’s why a sample
can be useful. A sample is a part
of a population that is representative
of the population. You might collect a data
sample about one spot in the city and analyze
the traffic there, or you might pull a
random sample from all existing data
in the population. How you choose your sample
will depend on your project. As you collect data,
you’ll also want to make sure you select
the right data type. For traffic data, an
appropriate data type could be the dates of traffic records
stored in a date format. The dates could help you figure what days of the week there is likely to be a high volume
of traffic in the future. We’ll explore this topic
in more detail soon. Finally, you need to determine the time frame for
data collection. In our example, if you needed
an answer immediately, you’d have to use
historical data, which is data that
already exists. But let’s say you needed to track traffic patterns over
a long period of time. That might affect
the other decisions you make during data collection. Now you know more about the different data
collection considerations you’ll use as a data analyst. Because of that,
you’ll be able to find the right data when you start
collecting it yourself. There’s still more to learn about data collection, so stay tuned.

Reading: Selecting the right data

Overview

Practice Quiz: Test your knowledge on collecting data

Which method of data-collection is most commonly used by scientists?

Organizations such as the U.S. Centers for Disease Control (CDC) often use data collected from hospitals. What kind of data is the CDC using if it is collected by hospitals, then sold to the CDC for its own analysis?

Fill in the blank: In data analytics, a _____ refers to all possible data values in a certain dataset.

Differentiate between data formats and structures


Video: Discover data formats

The article is about data formats and how they are used in data analysis. There are two main types of data: quantitative and qualitative. Quantitative data can be measured or counted, while qualitative data cannot. Quantitative data can be further divided into discrete and continuous data. Discrete data is counted and has a limited number of values, while continuous data can be measured using a timer and has a decimal value.

Qualitative data can be further divided into nominal and ordinal data. Nominal data is categorized without a set order, while ordinal data has a set order or scale.

There are two main types of data sources: internal and external. Internal data is data that lives within a company’s own systems, while external data is data that lives and is generated outside of an organization.

Data can also be structured or unstructured. Structured data is data that is organized in a certain format, such as rows and columns, while unstructured data is not organized in any easily identifiable manner.

Data formats are important because they make data more easily searchable and analysis-ready. Data analysts typically work with structured data, which is usually in the form of a table, spreadsheet, or relational database. However, they may also come across unstructured data, such as audio and video files.

Tutorial on “Discover data formats” in Data analysis

What is a data format?

A data format is the way that data is organized and stored. There are many different data formats, each with its own advantages and disadvantages. Some common data formats include:

  • CSV (comma-separated values): CSV is a simple and easy-to-read format that is often used to store data in spreadsheets.
  • JSON (JavaScript Object Notation): JSON is a lightweight format that is often used to store data in web applications.
  • XML (Extensible Markup Language): XML is a structured format that is often used to store data in complex documents.
  • Parquet: Parquet is a columnar storage format that is often used to store large datasets in data warehouses.
  • ORC (Optimized Row Columnar): ORC is a columnar storage format that is similar to Parquet.

Why is it important to know the format of a data file?

Knowing the format of a data file is important because it allows you to use the appropriate tools to read and analyze the data. For example, if you have a CSV file, you can use a spreadsheet program like Excel to read and analyze the data. However, if you try to open a CSV file in a word processing program like Microsoft Word, you will not be able to read the data correctly.

How to discover the format of a data file

There are a few different ways to discover the format of a data file:

  • Look at the file extension: The file extension is the three letters at the end of a file name. For example, a CSV file will have the file extension .csv.
  • Open the file in a text editor: If you open the file in a text editor, you should be able to see the format of the data. For example, if the data is in CSV format, you will see that the data is separated by commas.
  • Use a data analysis tool: Many data analysis tools can automatically detect the format of a data file. For example, if you open a CSV file in Excel, Excel will automatically detect that the file is in CSV format.

Examples of data formats in data analytics

Here are a few examples of how data formats are used in data analytics:

  • A data analyst might use a CSV file to store data from a customer survey.
  • A data analyst might use a JSON file to store data from a web application.
  • A data analyst might use an XML file to store data from a financial report.
  • A data analyst might use a Parquet file to store data from a data warehouse.
  • A data analyst might use an ORC file to store data from a data lake.

How to discover data formats in a data science project

When working on a data science project, there are a few things you can do to discover the format of the data you are working with:

  1. Ask the person who provided the data. If you have access to the person who provided the data, you can simply ask them what format the data is in. This is often the easiest and fastest way to find out the format of the data.
  2. Look at the documentation. If the data is from a public source, there may be documentation available that describes the format of the data. This documentation may be available on the website where you downloaded the data or in a separate file.
  3. Inspect the data. If you don’t have access to the person who provided the data and there is no documentation available, you can inspect the data yourself to try to determine the format. You can do this by opening the data in a text editor or by using a data analysis tool.

Once you have discovered the format of the data, you can use the appropriate tools to read and analyze the data.

Conclusion

Data formats are an important part of data analytics. Knowing the format of a data file allows you to use the appropriate tools to read and analyze the data. There are many different data formats, each with its own advantages and disadvantages. Some common data formats include CSV, JSON, XML, Parquet, and ORC.

When working on a data science project, there are a few things you can do to discover the format of the data you are working with:

  • Ask the person who provided the data.
  • Look at the documentation.
  • Inspect the data.

Once you have discovered the format of the data, you can use the appropriate tools to read and analyze the data.

An entertainment website displays a star rating for a movie based on user reviews. Users can select from one to five whole stars to rate the movie. The star rating is an example of what type of data? Select all that apply.

Ordinal, Discrete

The star rating is an example of ordinal data because the number of stars are in order of how much each person liked the movie. It’s also an example of discrete data because a person has to choose a full star measure; half-stars weren’t an option.

The use of external data is particularly valuable in which circumstances?

When analysis depends on as many data sources as possible

External data is particularly valuable when an analysis depends on as many sources as possible.

I don’t know about you, but when I’m choosing a movie to watch, I sometimes get stuck
between a couple of choices. If I’m in the mood for
excitement or suspense, I might go for a thriller, but if I need a good laugh, I’ll choose a comedy. If I really can’t decide
between two movies, I might even use some of
my data analysis skills to compare and contrast them. Come to think of it,
there really needs to be more movies about data analysts. I’d watch that, but since we can’t watch a
movie about data, at least not yet, we’ll do the next best thing:
watch data about movies! We’re going to take a look at this spreadsheet with movie data. We know we can compare different
movies and movie genres. Turns out, you can
do the same with data and data formats. Let’s use our movie
data spreadsheet to understand how that works. We’ll start with quantitative
and qualitative data. If we check out column A, we’ll
find titles of the movies. This is qualitative data
because it can’t be counted, measured, or easily
expressed using numbers. Qualitative data is
usually listed as a name, category, or description. In our spreadsheet,
the movie titles and cast members are
qualitative data. Next up is quantitative data, which can be measured or counted and then
expressed as a number. This is data with a certain
quantity, amount, or range. In our spreadsheet here, the last two columns
show the movies’s budget and box office revenue. The data in these columns
is listed in dollars, which can be counted, so we know that data
is quantitative. We can go even deeper
into quantitative data and break it down into
discrete or continuous data. Let’s check out
discrete data first. This is data that’s counted and has a limited
number of values. Going back to our spreadsheet, we’ll find each movie’s budget and box office returns
in columns M and N. These are both examples of discrete data that can be counted and have a
limited number of values. For example, the amount of money a movie makes can
only be represented with exactly two digits after the decimal to represent cents. There can’t be anything
between one and two cents. Continuous data can be
measured using a timer, and its value can be shown as a
decimal with several places. Let’s imagine a movie
about data analysts that I’m definitely going
to star in someday. You could express that
movie’s run time as 110.0356 minutes. You could even add
fractional data after the decimal point
if you needed to. There’s also nominal
and ordinal data. Nominal data is a type of qualitative data that’s
categorized without a set order. In other words, this data
doesn’t have a sequence. Here’s a quick example. Let’s say you’re collecting
data about movies. You ask people if they’ve
watched a given movie. Their responses would be in
the form of nominal data. They could respond “Yes,” “No,” or “Not sure.” These choices don’t have
a particular order. Ordinal data, on the other hand, is a type of qualitative data
with a set order or scale. If you asked a group of people to rank a movie from 1 to 5, some might rank it as a 2, others a 4, and so on. These rankings are in order of how much each person
liked the movie. Now let’s talk about
internal data, which is data that lives within
a company’s own systems. For example, if a movie
studio had compiled all of the data in the spreadsheet using only their own
collection methods, then it would be
their internal data. The great thing about
internal data is that it’s usually more reliable
and easier to collect, but in this spreadsheet, it’s more likely that
the movie studio had to use data owned or shared by other studios and sources because it includes
movies they didn’t make. That means they’d be
collecting external data. External data is, you guessed it, data that lives and is generated outside
of an organization. External data becomes
particularly valuable when your analysis depends on
as many sources as possible. A great thing about this data
is that it’s structured. Structured data is data that’s organized in
a certain format, such as rows and columns. Spreadsheets and
relational databases are two examples of software that can store data
in a structured way. You might remember our
earlier exploration of structured thinking, which helps you add a
framework to a problem so that you can solve it in an
organized and logical manner. You can think of structured
data in the same way. Having a framework for
the data makes the data easily searchable and
more analysis-ready. As a data analyst, you’ll work with a lot
of structured data, which will usually
be in the form of a table, spreadsheet or
relational database, but sometimes you’ll come
across unstructured data. This is data that
is not organized in any easily identifiable manner. Audio and video files are
examples of unstructured data because there’s no clear way to identify or organize
their content. Unstructured data might
have internal structure, but the data doesn’t
fit neatly in rows and columns like structured
data. And there you have it! Hopefully you’re now
more familiar with data formats and how you
might use them in your work. In just a bit, you’ll continue to explore
structured data and learn even more about the data you’ll use most often as an analyst. Coming soon to a screen near you.

Reading: Data formats in practice

Overview

Practice Quiz: Self-Reflection: Unstructured data

Overview

Video: Understanding structured data

Structured data is data that is organized in a format like rows and columns. It works nicely within a data model, which is a model that is used for organizing data elements and how they relate to one another. Data elements are pieces of information, such as people’s names, account numbers, and addresses. Data models help to keep data consistent and provide a map of how data is organized. This makes it easier for analysts and other stakeholders to make sense of their data and use it for business purposes. Structured data is also useful for databases, making it easy for analysts to enter, query, and analyze the data whenever they need to. Structured data can also be applied directly to charts, graphs, heat maps, dashboards, and most other visual representations of data. Spreadsheets and databases that store data sets are widely used sources of structured data.

Understanding structured data in data analysis

Structured data is data that is organized in a specific format, such as rows and columns. This makes it easy for computers to read and process structured data. Structured data is often stored in databases and spreadsheets.

Examples of structured data

Here are some examples of structured data:

  • Customer data in a CRM system
  • Product data in an e-commerce database
  • Financial data in an accounting system
  • Sensor data from IoT devices
  • Social media data

Why is structured data important for data analysis?

Structured data is important for data analysis because it is easy to search, sort, and filter. This makes it possible to quickly identify patterns and trends in the data. Structured data can also be used to create accurate and reliable reports.

How to use structured data for data analysis

To use structured data for data analysis, you will need to:

  1. Identify the data sources: The first step is to identify the data sources that contain the data you want to analyze. This could be a database, spreadsheet, or other type of file.
  2. Clean and prepare the data: Once you have identified the data sources, you will need to clean and prepare the data. This may involve removing duplicate rows, correcting errors, and converting the data to a consistent format.
  3. Analyze the data: Once the data is clean and prepared, you can start to analyze it. This may involve using statistical software to calculate summary statistics, create visualizations, and build models.
  4. Interpret the results: Once you have analyzed the data, you need to interpret the results. This involves identifying any patterns or trends in the data and drawing conclusions.

Tips for using structured data for data analysis

Here are some tips for using structured data for data analysis:

  • Use a data analysis tool: A data analysis tool will make it easier to clean, prepare, and analyze structured data. There are many different data analysis tools available, both free and commercial.
  • Start with a clear question: Before you start analyzing the data, it is important to have a clear question in mind. This will help you to focus your analysis and get the most out of the data.
  • Use visualization: Visualization can be a powerful tool for understanding structured data. Charts and graphs can help you to identify patterns and trends that might be difficult to see in the raw data.
  • Be critical of the results: It is important to be critical of the results of your analysis. Consider the quality of the data, the limitations of the analysis methods, and the potential for bias.

Conclusion

Structured data is an important part of data analysis. It is easy to read and process, and it can be used to create accurate and reliable reports. By following the tips above, you can use structured data to gain valuable insights into your business or organization.

Hi, great to see you again! Earlier, we compared some data formats,
including structured and unstructured data. Most of the data being generated
right now is actually unstructured. Audio files, video files,
emails, photos, and social media are all examples
of unstructured data. These can be harder to analyze
in their unstructured format. But here’s the good news, you’ll be working with structured
data most of the time. For example, if you need to analyze data
about the unstructured data in emails, photos, and social media sites, it’ll most likely be structured for
analysis before you even get to it. Because of that, I want to explore
structured data a bit more. As a quick refresher, structured data is
data organized in a format like rows and columns. But there’s definitely
more to it than that. Structured data works nicely within a data
model, which is a model that is used for organizing data elements and
how they relate to one another. What are data elements? They’re pieces of information,
such as people’s names, account numbers, and addresses. Data models help to keep
data consistent and provide a map of how data is organized. This makes it easier for analysts and other stakeholders to make sense of their
data and use it for business purposes. In addition to working well within data
models, structured data is also useful for databases. This makes it easy for
analysts to enter, query, and analyze the data whenever they need to. This also helps make data
visualization pretty easy because structured data can be applied
directly to charts, graphs, heat maps, dashboards and most other
visual representations of data. Alright, so now we know
that spreadsheets and databases that store data sets are widely
used sources of structured data. After you explore some
other data structures, you’ll check out more data
types using a spreadsheet. The adventure continues!

Reading: The structure of data

Overview

Reading: Data modeling levels and techniques

Overview

Practice Quiz: Test your knowledge on data formats and structures

Fill in the blank: The running time of a movie is an example of _____ data.

What are the characteristics of unstructured data? Select all that apply.

Structured data enables data to be grouped together to form relations. This makes it easier for analysts to do what with the data? Select all that apply.

Which of the following is an example of unstructured data?

Explore data types, fields, and values


Video: Know the type of data you’re working with

Data types in spreadsheets can be one of three things: a number, a text or string, or a Boolean.

Number data types are used to represent numerical values, such as search interests, currency, and dates.

Text or string data types are used to represent sequences of characters and punctuation, such as names, addresses, and phone numbers.

Boolean data types are used to represent two possible values: true or false.

It is important to know the data type of each cell in a spreadsheet, as this will affect how the data can be used and manipulated. For example, if you try to perform a mathematical operation on a text cell, you will get an error.

One common issue that people encounter in spreadsheets is mistaking data types with cell values. For example, if you try to average a column of text cells, you will get an error. This is because the text cells are not numbers, and therefore cannot be averaged.

To avoid errors, it is important to make sure that you are using the correct data type for each cell in your spreadsheet. You can do this by checking the cell format or by using the data type validation feature.

Know the type of data you’re working with in Data analytics

The type of data you’re working with is one of the most important factors to consider when conducting data analytics. The type of data will dictate the types of analyses you can perform and the types of conclusions you can draw.

There are two main types of data: qualitative and quantitative.

  • Qualitative data is data that is descriptive in nature, such as names, addresses, and opinions. Qualitative data cannot be measured or counted.
  • Quantitative data is data that is numerical in nature, such as height, weight, and temperature. Quantitative data can be measured and counted.

Within these two main categories, there are many different types of data. Here are some examples:

  • Demographic data: This type of data includes information about people, such as age, gender, education, and income. Demographic data is often used to understand the characteristics of a population or to segment a population into different groups.
  • Behavioral data: This type of data tracks how people interact with products, services, and content. Behavioral data can be used to understand customer preferences, identify trends, and improve product development.
  • Financial data: This type of data includes information about a company’s financial performance, such as revenue, expenses, and profits. Financial data is often used to assess the financial health of a company and to make investment decisions.
  • Operational data: This type of data tracks the performance of business processes, such as order fulfillment and customer service. Operational data can be used to identify bottlenecks, improve efficiency, and reduce costs.

Once you know the type of data you’re working with, you can start to think about the types of analyses you can perform. For example, if you have quantitative data, you can perform statistical analysis to identify trends and patterns. If you have qualitative data, you can perform content analysis to identify common themes and ideas.

Here are some tips for knowing the type of data you’re working with in data analytics:

  • Identify the source of the data. Where did the data come from? What is the purpose of the data?
  • Look at the data. What kind of information is included in the data? Is the data in a numerical or textual format?
  • Check the data type of each column in the data set. This can be done using a spreadsheet program or a programming language such as Python.
  • Look for outliers and missing values. Outliers are data points that are significantly different from the rest of the data. Missing values are data points that are missing from the data set.

It is important to know the type of data you’re working with in data analytics in order to perform the correct analysis and get accurate results.

Here are some examples of how different types of data can be used in data analytics:

  • A retail company could use demographic data to understand the characteristics of its customers. This information could then be used to develop targeted marketing campaigns.
  • A technology company could use behavioral data to track how users interact with its products. This information could then be used to improve the user experience and develop new features.
  • A financial services company could use financial data to assess the risk of a loan applicant. This information could then be used to make lending decisions.
  • A manufacturing company could use operational data to track the performance of its production lines. This information could then be used to identify bottlenecks and improve efficiency.

By understanding the different types of data and how they can be used in data analytics, you can make better decisions about how to collect, clean, and analyze your data.

By now you’ve learned
a lot about data. From generated data, to
collected data, to data formats, it’s good to know
as much as you can about the data you’ll
use for analysis. In this video, we’ll
talk about another way you can describe
data: the data type. A data type is a
specific kind of data attribute that tells what kind of
value the data is. In other words, a data type tells you what kind of data
you’re working with. Data types can be different depending on the query
language you’re using. For example, SQL allows for different data types depending on which database you’re using. For now though, let’s focus on the data types that you’ll
use in spreadsheets. To help us out, we’ll use a spreadsheet that’s
already filled with data. We’ll call it “Worldwide Interests in Sweets
through Google Searches.” Now a data type in a spreadsheet can be
one of three things: a number, a text or string, or a Boolean. You might find spreadsheet
programs that classify them a bit differently or
include other types, but these value types
cover just about any data you’ll find
in spreadsheets. We’ll look at all of
these in just a bit. Looking at columns B, D, and F, we find number data types. Each number represents
the search interest for the terms “cupcakes,” “ice cream,” and “candy”
for a specific week. The closer a number is to 100, the more popular that search
term was during that week. One hundred represents
peak popularity. Keep in mind that in this case, 100 is a relative value, not the actual
number of searches. It represents the maximum number of searches during
a certain time. Think of it like a
percentage on a test. All other searches are then
also valued out of 100. You might notice this in
other data sets as well. Gold star for 100! If you needed to, you could
change the numbers into percents or other
formats, like currency. These are all examples
of number data types. In column H, the data shows the most popular
treat for each week, based on the search data. So as we’ll find in cell H4 for the week beginning
July 28th, 2019, the most popular
treat was ice cream. This is an example
of a text data type, or a string data type, which is a sequence
of characters and punctuation that contains
textual information. In this example, that information would be the treats
and people’s names. These can also
include numbers, like phone numbers or numbers
in street addresses. But these numbers wouldn’t
be used for calculations. In this case they’re treated
like text, not numbers. In columns C, E, and G, it seems like
we’ve got some text. But the text here isn’t a
text or string data type. Instead, it’s a
Boolean data type. A Boolean data type is a data type with only
two possible values: true or false. Columns C, E, and G show Boolean data for whether the search interest for each week, is at least 50 out of 100. Here’s how it works.
To get this data, we’ve created a formula
that calculates whether the search interest
data in columns B, D, and F is 50 or greater. In cell B4, the search
interest is 14. In cell C4, we find
the word false because, for this week of data, the search interest
is less than 50. For each cell in columns C, E, and G, the only two possible
values are true or false. We could change the formula so other words appear in
these cells instead, but it’s still Boolean data. You’ll get a chance to read more about the Boolean data type soon. Let’s talk about a
common issue that people encounter in spreadsheets: mistaking data types
with cell values. For example, in cell B57, we can create a formula to
calculate data in other cells. This will give us the average
of the search interests in cupcakes across all
weeks in the dataset, which is about 15. The formula works because we calculated using a
number data type. But if we tried it with a
text or string data type, like the data in column
C, we’d get an error. Error values usually
happen if a mistake is made in entering the
values in the cells. The more you know your data
types and which ones to use, the less errors you’ll run into. There you have it, a
data type for everyone. We’re not done yet. Coming up, we’ll go deeper into the
relationship between data types, fields, and values. See you soon.

Reading: Understanding Boolean logic

Reading

Video: Data table components

A data table, or tabular data, is arranged in rows and columns. The rows can be called “records” and the columns can be called “fields”. Each record has the same fields as the other records in the same order. Each field has a specific data type, such as text, number, or Boolean. Data tables are used in a variety of applications, such as music playlists, calendars, and email inboxes. As a data analyst, you will work with many different data tables and it is important to understand the structures of the tables you are working with.

Data table components in Data analytics

Data tables are a fundamental component of data analytics. They are used to organize and display data in a way that is easy to understand and analyze. Data tables can be used to store and analyze a variety of data types, including quantitative data, qualitative data, and structured data.

Data tables are typically composed of the following components:

  • Header row: The header row contains the names of the columns in the table.
  • Data rows: The data rows contain the actual data.
  • Footer row: The footer row can contain summary statistics for the data, such as the mean, median, and mode.

Data tables can also include other components, such as:

  • Row labels: Row labels can be used to identify the individual rows in the table.
  • Column labels: Column labels can be used to identify the individual columns in the table.
  • Data types: Data types can be used to specify the type of data that is stored in each column.
  • Filters: Filters can be used to display only a subset of the data in the table.
  • Sorts: Sorts can be used to arrange the data in the table in a specific order.

Data tables can be used to perform a variety of data analysis tasks, such as:

  • Descriptive statistics: Data tables can be used to calculate descriptive statistics for data, such as the mean, median, mode, and standard deviation.
  • Correlation analysis: Data tables can be used to calculate the correlation between two or more variables.
  • Regression analysis: Data tables can be used to perform regression analysis to identify the relationship between two or more variables.
  • Data visualization: Data tables can be used to create data visualizations, such as charts and graphs.

Data tables are a powerful tool for data analytics. By understanding the different components of data tables and how to use them, you can perform a variety of data analysis tasks and gain insights from your data.

Here are some tips for using data tables effectively in data analytics:

  • Use descriptive column labels. The column labels should clearly describe the data that is contained in each column.
  • Use consistent data types. All of the data in a column should be of the same data type. This will make it easier to perform data analysis tasks.
  • Use filters and sorts. Filters and sorts can be used to display only a subset of the data in the table and to arrange the data in a specific order. This can make it easier to identify patterns and trends in the data.
  • Use data visualization. Data visualizations can be used to communicate the findings of your data analysis in a clear and concise way.

By following these tips, you can use data tables effectively to perform data analytics and gain insights from your data.

Here’s a riddle for you. What do a music playlist, a calendar
agenda, and an email inbox have in common? I’ll give you a hint. It’s not a weekly jam session. The answer is they’re
all arranged in tables. Go ahead and check out your email inbox or
a favorite playlist, or look at your calendar agenda. There’s tables in every one! A data table, or tabular data,
has a very simple structure. It’s arranged in rows and columns. You can call the rows “records”
and the columns “fields.” They basically mean the same thing, but records and fields can be used for
any kind of data table, while rows and columns are usually reserved for
spreadsheets. When talking about structured databases, people in data analytics usually
go with “records” and “fields.” Sometimes a field can also
refer to a single piece of data, like the value in a cell. In any case, you’ll hear both versions of these terms
used throughout this program and your job. Let’s go back to our playlist example. We’ll use the new terms
we just introduced. So each song is a record. Each record has the same fields as
the other records in the same order. In other words, the playlist has
the same information about each song. Each song characteristic,
like the title and the artist, is a field. Each separate field has
the same data type, but different fields can have different types. Let me show you what I mean. For the song list, the song titles
are a text or string type, while the song’s length could be a number type
if you’re using it for calculations. Or it could be a date and time type. The column for favorites is Boolean since it has two possible values:
favorite or not favorite. We can view spreadsheets in the same way. The records in a spreadsheet might
be about all sorts of things: clients, products, invoices,
or anything else. Each record has several fields, which reveal more about
the clients, products, or invoices. The value in every cell contains
a specific piece of data, like the address of a client or
the dollar amount of an invoice. As a data analyst, lots of data will
come your way, and records, fields, and values in data tables will
help you navigate analysis. Understanding the structures of the tables
you’re working with is a part of that. And hopefully, while you’re
working hard on your analysis and those tables, you can have a little
fun with a different data table: the one with your favorite playlist!

When discussing structured databases, data analysts refer to the data contained in a row as a record. How do they refer to the data contained in a column?

Field

Data analysts refer to the data contained in a column as a field.

Practice Quiz: Hands-On Activity: Applying a function

Video: Meet wide and long data

  • Wide data is a data format in which each row represents a single data subject, and each column represents a single attribute of that subject.
  • Long data is a data format in which each row represents a single data point, and each column represents a single variable.
  • Wide data is easier to identify and compare different columns.
  • Long data is easier to store and analyze multiple variables for each subject at each time point.
  • The best data format to use depends on the specific needs of the data analysis task.

Meet wide and long data in Data analytics

Wide and long data are two common data formats used in data analytics. Each format has its own advantages and disadvantages, and the best format to use depends on the specific needs of the analysis.

Wide data

Wide data is a data format in which each row represents a single data subject and each column represents a different attribute of that subject. For example, a wide dataset about the population of Latin and Caribbean countries might have one row for each country and one column for each year.

Long data

Long data is a data format in which each row represents a single data point in time for a single subject, and each column represents a different variable. For example, a long dataset about the population of Latin and Caribbean countries might have one row for each country-year pair, with columns for the country, year, population, and other variables such as the average age of the population.

Advantages and disadvantages of wide and long data

Wide data is good for identifying and comparing different columns. For example, in the wide dataset about the population of Latin and Caribbean countries, it is easy to compare the annual populations of different countries or the populations of the same country at different points in time.

Long data is good for storing and organizing data when there are multiple variables for each subject at each time point. For example, in the long dataset about the population of Latin and Caribbean countries, it is easy to store and analyze data on the population, average age, and other variables for each country-year pair.

Transforming between wide and long data

It is often necessary to transform wide data into long data, or vice versa, depending on the needs of the analysis. For example, if you want to analyze the relationship between the population and average age of Latin and Caribbean countries over time, you would need to transform the wide data into long data.

There are a variety of tools and resources available to help you transform wide and long data. For example, many spreadsheet programs have built-in functions for transforming data formats.

When to use wide and long data

The best data format to use depends on the specific needs of the analysis. If you need to identify and compare different columns, then wide data is a good choice. If you need to store and organize data when there are multiple variables for each subject at each time point, then long data is a good choice.

Here are some examples of when to use wide and long data:

  • Wide data:
    • Comparing the sales of different products in different regions
    • Analyzing the performance of different marketing campaigns
    • Identifying the characteristics of different customer segments
  • Long data:
    • Tracking the customer journey over time
    • Analyzing the impact of different interventions on patient outcomes
    • Forecasting future sales or demand

By understanding the advantages and disadvantages of wide and long data, you can choose the right format for your analysis.

You probably use the words “wide” and
“long” all the time. You might use “wide” to describe the size
of something from side to side, like a wide river. But a river can also
travel great distances, so you might call it “long” as well. Wait! Before you stop the video, I promise you didn’t accidentally
click in the wrong course. I’m not here to teach you words you
already know. But the words “wide” and “long” can be used to describe data, too. So I am here to help you understand
wide data and long data. So far you’ve dealt with data
arranged mostly in a wide format. With wide data, every data subject has
a single row with multiple columns to hold the values of various
attributes of the subject. Here’s some wide data in a spreadsheet. You might remember we discussed this
data about the population of Latin and Caribbean countries earlier. For this data set, each row provides all
of the population information about one country. Each column shows
the population for a different year. Wide data lets you easily identify and
quickly compare different columns. In our example, the data is arranged
alphabetically by country, so you can compare the annual populations
of Antigua and Barbuda, Aruba, and the Bahamas by just checking
out the values in each column. The wide data format also
makes it easy to find and compare the countries’ populations
at different periods of time. For example, by sorting the data, we discover that Brazil had the highest
population of all countries in 2010, and the British Virgin Islands had the lowest
population of all countries in 2013. Okay, now let’s explore
this data in a long format. Here the data is no longer
organized into columns by year. All the years are now in one column with
each country, like Argentina, appearing in multiple rows, one for each year of data. This is how long data usually looks. Long data is data in which each row
is one time point per subject, so each subject will have
data in multiple rows. Our spreadsheet is formatted to show
each year of population data. Here we see Antigua and Barbuda first. Long data is a great format for storing
and organizing data when there’s multiple variables for each subject at each
time point that we want to observe. With this long data format,
we can store and analyze all of this data using fewer
columns. Plus, if we added a new variable, like the average age of a population,
we’d only need one more column. If we’d use a wide data format instead, we
would have needed 10 more columns, one for each year. The long data format keeps
everything nice and compact. If you’re wondering which
format you should use, the simple answer is, “it depends.” Sometimes you’ll have to transform
wide data into a long data format, or other times vice versa. You’ll probably work with
both formats in your job. And you’ll definitely revisit both
formats again later in this program. That reminds me: earlier we define
data as a collection of facts. As you’ve discovered over the last few
videos, that collection of facts can take on lots of different formats,
structures, types, and more. Learning about all of the ways that
data can be presented will be a big help to you throughout the data
analysis process. The more you work with
data in all its forms, the quicker you’ll start to recognize
which data to use, and when to use it. And in just a bit, you’ll use all that data stored in your
brain to help you take an assessment. After that, you’ll learn how to
identify and avoid bias in data and how to embrace credibility,
integrity and ethics. The data adventure moves forward.
I’m so glad you’re moving with it!

Reading: Transforming data

Practice Quiz: Hands-on Activity: Introduction to Kaggle

Practice Quiz: Test your knowledge on data types, fields, and values

Fill in the blank: Internet search engines are an everyday example of how Boolean operators are used. The Boolean operator _____ expands the number of results when used in a keyword search.

Which of the following statements accurately describes a key difference between wide and long data?

What does data transformation enable data analysts to accomplish?

Weekly challenge 1


Reading: Glossary: Terms and definitions

Quiz: *Weekly challenge 1*

What is the most likely reason that a data analyst would use historical data instead of gathering new data?

Which of the following are examples of discrete data? Select all that apply.

Which of the following questions collect nominal qualitative data? Select all that apply.

Which of the following is a benefit of internal data?

A data analyst is reviewing data that has been organized into a table format. What type of data is in the table?

A Boolean data type must have a numeric value.

In wide data, what do the columns contain?

What is the term for changing the structure or format of data?