We all generate lots of data in our daily lives. In this part of the course, you’ll check out how we generate data and how analysts decide which data to collect for analysis. You’ll also learn about structured and unstructured data, data types, and data formats as you start thinking about how to prepare your data for exploration.
Learning Objectives
- Explain how data is generated as a part of our daily activities with reference to the types of data generated
- Explain factors that should be considered when making decisions about data collection
- Explain the difference between structured and unstructured data
- Discuss the difference between data and data types
- Explain the relationship between data types, fields, and values
- Discuss wide and long data formats with references to organization and purpose
- Data exploration
- Collecting data
- Differentiate between data formats and structures
- Explore data types, fields, and values
- Video: Know the type of data you're working with
- Reading: Understanding Boolean logic
- The AND operator
- The OR operator
- The NOT operator
- Video: Data table components
- Practice Quiz: Hands-On Activity: Applying a function
- Video: Meet wide and long data
- Reading: Transforming data
- Practice Quiz: Hands-on Activity: Introduction to Kaggle
- Practice Quiz: Test your knowledge on data types, fields, and values
- Reading: Glossary: Terms and definitions
- Quiz: *Weekly challenge 1*
Data exploration
Video: Introduction to data exploration
This course will teach you how to prepare data for analysis, a crucial step in the data analysis process. You will learn to identify how data is generated and collected, and you will explore different formats, types, and structures of data. You will also learn how to choose and use data that will help you understand and respond to a business problem, and how to analyze data for bias and credibility. Additionally, you will learn what clean data means and how to extract data from a database using spreadsheets and SQL. Finally, you will learn the basics of data organization and data protection.
Here is a more detailed overview of each topic:
- Identifying how data is generated and collected: You will learn about the different ways that data is generated and collected, such as through surveys, sensors, and social media. You will also learn about the different types of data, such as structured, unstructured, and semi-structured data.
- Exploring different formats, types, and structures of data: You will learn about the different formats that data can be stored in, such as CSV, JSON, and XML. You will also learn about the different types of data, such as numerical, categorical, and text data. Finally, you will learn about the different structures that data can have, such as tables, hierarchies, and graphs.
- Choosing and using data to understand and respond to a business problem: You will learn how to choose the right data for your analysis based on the business problem that you are trying to solve. You will also learn how to use data to answer your questions and to make recommendations.
- Analyzing data for bias and credibility: You will learn how to identify and mitigate bias in data. You will also learn how to assess the credibility of data.
- Understanding clean data: You will learn what clean data is and why it is important. You will also learn how to clean data.
- Extracting data from a database using spreadsheets and SQL: You will learn how to use spreadsheets and SQL to extract data from a database.
- Learning the basics of data organization and data protection: You will learn how to organize data in a way that is efficient and easy to use. You will also learn how to protect data from unauthorized access and corruption.
By the end of this course, you will have the skills necessary to prepare data for analysis in a variety of settings. You will be able to choose and use the right data for your analysis, clean the data, and extract data from a database. You will also be able to organize and protect your data.
Introduction to Data Exploration
Data exploration is the process of analyzing data to discover patterns, trends, and insights. It is an essential step in any data science project, as it allows you to understand your data and identify the most important questions to ask.
There are a number of different techniques that can be used for data exploration, but some of the most common include:
- Visualizations: Visualizations are a great way to get a quick overview of your data and to identify any obvious patterns or trends. Some common visualizations include histograms, bar charts, line charts, and scatter plots.
- Summary statistics: Summary statistics can be used to get a more detailed understanding of your data. Some common summary statistics include the mean, median, mode, and standard deviation.
- Grouping and aggregation: Grouping and aggregation can be used to identify patterns in your data that are not immediately obvious. For example, you could group your data by customer age and then calculate the average purchase amount for each age group.
- Anomalies: Anomalies are data points that are significantly different from the rest of the data. Identifying anomalies can be helpful for identifying fraud or other problems.
Here is a step-by-step guide to data exploration:
- Gather your data: The first step is to gather the data that you want to explore. This data could come from a variety of sources, such as a database, a spreadsheet, or a log file.
- Clean your data: Once you have your data, you need to clean it. This means removing any errors or inconsistencies in the data.
- Explore your data: Once your data is clean, you can start to explore it. This can be done using a variety of techniques, such as visualizations, summary statistics, grouping and aggregation, and anomalies.
- Identify patterns and trends: As you explore your data, look for patterns and trends. These patterns and trends can help you to understand your data and to identify the most important questions to ask.
- Formulate hypotheses: Once you have identified some patterns and trends, you can start to formulate hypotheses. Hypotheses are statements about your data that you can test using further analysis.
Data exploration is an iterative process. As you learn more about your data, you may need to go back and explore it again using different techniques.
Here are some tips for data exploration:
- Start with a clear goal: What do you want to learn from your data? Once you know your goal, you can focus your exploration on the most relevant data and techniques.
- Be creative: There are many different ways to explore data. Don’t be afraid to experiment with different techniques and visualizations.
- Use a variety of tools: There are a number of different tools available for data exploration. Use the tools that are most comfortable for you and that are best suited for your data.
- Collaborate with others: Data exploration can be a challenging task. Don’t be afraid to collaborate with others, such as data scientists or analysts.
Data exploration is an essential skill for any data scientist or data analyst. By following the tips above, you can learn how to explore your data effectively and to discover valuable insights.
Picture this: You’re working on a project. You’ve asked all the right questions,
applied structured thinking, and you’re completely in sync
with your stakeholders. You’re off to a great start. But there’s another step in the process:
preparing the data correctly. This is where understanding
the different types of data and data structures comes in. Knowing this lets you figure out
what type of data is right for the question you’re answering. Plus, you’ll gain practical
skills about how to extract, use, organize, and protect your data. Hey, my name is Hallie, and
I’m an analytical lead at Google. I work with companies in
the healthcare industry. I’m so
excited to welcome you to this course. You’ve been building up your data analyst
skills in lots of different ways so far. You’ve learned how to ask the right
questions, define the problem, and present your analysis in a way that matches up
with the needs of your stakeholders. In other words, you’ve learned
how to tell a story using data. Now we’ll learn more about the data
that you’ll need to tell the best story possible. But before we do that,
I’d love to tell you my story. I use analytics to help healthcare
companies develop digital marketing solutions that make their business and
their brands stronger. My team and I find business and media opportunities based on
the latest industry and data insights. I’ve been working in healthcare for
about five years, and it’s great. I really enjoy being able to use
data to help spark change in such an important industry. As you’ll discover in this course, data can be the main character
in a very powerful story. I absolutely love using analysis to tell
that story in a way that’s compelling and informative. Here’s a real life example of how
I’ve used data to tell a story. In my job, we analyze Medicare
enrollment data over time and make connections to how people
research Medicare plans on Google. As people 65 and older become more
informed decision makers for their health, I use the data to learn if there’s
an increase in Medicare enrollments and what part Google searches play
if there is an increase in demand. Now it’s very important that I make
sure the data is relevant and valid. I also have to pay attention
to questions around access and equity while maintaining the privacy
of those conducting searches. The happy ending of my story is that
the data in my findings is useful to medical professionals and their patients. There’s so much useful data out there, and
you’re building the skills you’ll need to find and
use the right data in the best way. In this course,
you’ll continue sharpening those skills. So you’ve already heard a lot about
the data analysis process steps: Ask, Prepare, Process,
Analyze, Share and Act. Now it’s time to learn
how to prepare the data. You’ll learn to identify how data
is generated and collected, and you’ll explore different formats,
types and structures of data. We’ll make sure you know how to choose and
use data that’ll help you understand and respond to a business problem. And because not all data fits each need,
you’ll learn how to analyze data for bias and credibility. We’ll also explore what clean data means. But wait, there’s more. You’ll also get up close and
personal with databases. We’ll cover what they are and
how analysts use them. You’ll even get to extract your
own data from a database using a couple of tools that you’re already
familiar with: spreadsheets and SQL. The key here is patience. Like anything worth doing,
this will take time and practice. And I’ll be with you
every step of the way. Still with me? Great. The last few things we’ll cover
are the basics of data organization and the process of protecting your data. Data works best when it’s organized. And if you’re organizing your data,
you’ll want to protect it too. I’ll show you how to do both and
apply it to your own analysis. I’m so excited to help you write
your own personal story as you continue exploring the world
of data analytics. So let’s do it.
Reading: Course syllabus
Overview
- Foundations: Data, Data, Everywhere
- Ask Questions to Make Data-Driven Decisions
- Prepare Data for Exploration (this course)
- Process Data from Dirty to Clean
- Analyze Data to Answer Questions
- Share Data Through the Art of Visualization
- Data Analysis with R Programming
- Google Data Analytics Capstone: Complete a Case Study
Welcome to the third course in the Google Data Analytics Certificate! So far, you have been introduced to the field of data analytics and discovered how data analysts can use their skills to answer business questions.
As a data analyst, you need to be an expert at structuring, extracting, and making sure the data you are working with is reliable. To do this, it is always best to develop a general idea of how all data is generated and collected, since every organization structures data differently. Then, no matter what data structure you are faced with in your new role, you will feel confident working with it.
You will soon discover that when data is extracted, it isn’t perfect. It might be biased instead of credible, or dirty instead of clean. Your goal is to learn how to analyze data for bias and credibility and to understand what clean data means. You will also get up close and personal with databases and even get to extract your own data from a database using spreadsheets and SQL. The last topics covered are the basics of data organization and the process of protecting your data.
And you will learn how to identify different types of data that can be used to understand and respond to a business problem. In this part of the program, you will explore different types of data and data structures. And best of all, you will keep adding to your data analyst tool box! From extracting and using data, to organizing and protecting it, these key skills will come in handy no matter what you are doing in your career as a data analyst.
Course content
Course 3 – Prepare Data for Exploration
- Understanding data types and structures: We all generate lots of data in our daily lives. In this part of the course, you will check out how we generate data and how analysts decide which data to collect for analysis. You’ll also learn about structured and unstructured data, data types, and data formats as you start thinking about how to prepare your data for exploration.
- Understanding bias, credibility, privacy, ethics, and access: When data analysts work with data, they always check that the data is unbiased and credible. In this part of the course, you will learn how to identify different types of bias in data and how to ensure credibility in your data. You will also explore open data and the relationship between and importance of data ethics and data privacy.
- Databases: Where data lives: When you are analyzing data, you will access much of the data from a database. It’s where data lives. In this part of the course, you will learn all about databases, including how to access them and extract, filter, and sort the data they contain. You will also check out metadata to discover the different types and how analysts use them.
- Organizing and protecting your data: Good organization skills are a big part of most types of work, and data analytics is no different. In this part of the course, you will learn the best practices for organizing data and keeping it secure. You will also learn how analysts use file naming conventions to help them keep their work organized.
- Engaging in the data community (optional): Having a strong online presence can be a big help for job seekers of all kinds. In this part of the course, you will explore how to manage your online presence. You will also discover the benefits of networking with other data analytics professionals.
- Completing the Course Challenge: At the end of this course, you will be able to apply what you have learned in the Course Challenge. The Course Challenge will ask you questions about the key concepts and then will give you an opportunity to put them into practice as you go through two scenarios.
What to expect
This part of the program is designed to get you familiar with different data structures and show you how to collect, apply, organize, and protect data. All of these skills will be part of your daily tasks as an entry-level data analyst. You will work on a wide range of activities that are similar to real-life tasks that data analysts come across on a daily basis.
This course has five modules or weeks, and each has several lessons included. Within each lesson, you will find content such as:
- Videos of instructors teaching new concepts and demonstrating the use of tools
- In-video questions that pop up during or at the end of a video to check your learning
- Readings to introduce new ideas and build on the concepts from the videos
- Discussion forums to discuss, explore, and reinforce new ideas for better learning
- Discussion prompts to promote thinking and engagement in the discussion forums
- Hands-on activities to introduce real-world, on-the-job situations, and the tools and tasks to complete assignments
- Practice quizzes to prepare you for graded quizzes
- Graded quizzes to measure your progress and give you valuable feedback
Hands-on activities promote additional opportunities to build your skills. Try to get as much out of them as possible. Assessments are based on the approach taken by the course to offer a wide variety of learning materials and activities that reinforce important skills. Graded and ungraded quizzes will help the content sink in. Ungraded practice quizzes are a chance for you to prepare for the graded quizzes. Both types of quizzes can be taken more than one time.
As a quick reminder, this course is designed for all types of learners, with no degree or prior experience required. Everyone learns differently, so the Google Data Analytics Certificate has been designed with that in mind. Personalized deadlines are just a guide, so feel free to work at your own pace. There is no penalty for late assignments. If you prefer, you can extend your deadlines by returning to Overview in the navigation pane and clicking Switch Sessions. If you already missed previous deadlines, click Reset my deadlines instead.
If you would like to review previous content or get a sneak peek of upcoming content, you can use the navigation links at the top of this page to go to another course in the program. When you pass all required assignments, you will be on track to earn your certificate.
Optional speed track for those experienced in data analytics
The Google Data Analytics Certificate provides instruction and feedback for learners hoping to earn a position as an entry-level data analyst. While many learners will be brand new to the world of data analytics, others may be familiar with the field and simply wanting to brush up on certain skills.
If you believe this course will be primarily a refresher for you, we recommend taking the practice diagnostic quiz offered this week. It will enable you to determine if you should follow the speed track, which is an opportunity to proceed to Course 4 after having taken each of the Course 3 Weekly Challenges and the overall Course Challenge. Learners who earn 100% on the diagnostic quiz can treat Course 3 videos, readings, and activities as optional. Learners following the speed track are still able to earn the certificate.
Tips
- Do your best to complete all items in order. All new information builds on earlier learning.
- Treat every task as if it is real-world experience. Have a mindset that you are working at a company or in an organization as a data analyst. This will help you apply what you learn in this program to the real world.
- Even though they aren’t graded, it is important to complete all practice items. They will help you build a strong foundation as a data analyst and better prepare you for the graded assessments.
- Take advantage of all additional resources provided.
- When you encounter useful links in the course, remember to bookmark them so you can refer to the information later for study or review.
Video: Hallie: Fascinating data insights
Hallie is an analytical lead at Google who is passionate about using data to help the healthcare industry. She started her career analyzing large sums of patient data, and she has since expanded her skills to include user research and understanding how users search on Google.
Hallie believes that the most important skill in data analysis is creativity. She enjoys piecing together nuggets of information to create a narrative using data. This skillset is essential for identifying trends and patterns in data, and for communicating those findings to others in a clear and concise way.
In her role at Google, Hallie works with healthcare companies to help them understand the industry and to better market to their target audiences. She also conducts research on the healthcare industry to identify new trends and opportunities.
Hallie’s work has a direct impact on the healthcare industry. By helping healthcare companies to use data more effectively, she is helping them to improve their marketing campaigns, to develop new products and services, and to provide better care to their patients.
Here are some specific examples of how Hallie’s work has helped the healthcare industry:
- She developed a new data-driven approach to marketing that has helped healthcare companies to increase their brand awareness and to reach more potential customers.
- She created a report on the latest trends in healthcare search that has helped healthcare companies to better understand the needs of their patients and to develop more targeted marketing campaigns.
- She worked with a team of engineers to develop a new tool that helps healthcare providers to identify patients who are at risk of developing certain diseases. This tool has helped healthcare providers to intervene early and to prevent patients from developing serious health problems.
Hallie is a valuable asset to the healthcare industry. She is a skilled data analyst and a creative thinker. She is also passionate about using data to make a positive impact on the world.
Healthcare is just a really
fascinating place in the US. It’s a really incredible
industry to work in because it is so
historically traditional, and healthcare companies,
unlike other tech companies, just really have not used
data to inform decisions. When I was in college, I had a professor who didn’t want us to have
textbooks because he just said the healthcare
industry was changing so rapidly, and it wouldn’t make
sense to have a textbook, which is just a
static piece of text when things were just
really evolving. So I would say healthcare
and data and the two together is a newer
concept using big data, using machine learning, and artificial intelligence to help the healthcare
industries. I started analyzing large
sums of patient data. That was the first
time I had really worked with such huge datasets, and I found it really
fascinating that we can take all of
these datasets and synthesize them and
allow us to really deliver some cool insights and trends to our
hospital systems. That was the first time I started thinking about data analysis, data analytics, as a
possible career for me. That’s really what brought me to this analytical lead role at
Google where I could take that knowledge and
that skill set of analyzing datasets and do
that on a daily basis, so that really, every
conversation I was having with the client was a
data-informed conversation. I work within the
healthcare vertical. We have companies who market on our platforms, like Google
Search and YouTube. We help them understand the healthcare industry
so that they can better market to the audience that they’re trying to reach. Whether you’re a
healthcare insurer or you’re a health care provider, maybe a hospital system, they all have different needs
on how they want to reach their audience using
Google’s platforms. We help them optimize
their marketing spend, but we also do a lot of research in the
healthcare industry. Some user research, some understanding of how
users are really just searching on Google
to give them a sense of what’s really happening in the industry and how they
can market effectively. I would say that my technical skills with data
analytics came with time. The most important skill I found, which has also come with
time and grown with me, is just the creativity
side of data analysis. I mean, you can really learn a lot of the SQL skills and R, and I know some of that
is within the course. But really, the
creativity side is something that just
comes with experience. When you’re looking at a dataset, you might look at it one way
and analyze it one way and then have someone else look at it or look at it a week later, and then all of a
sudden the trend that you’re seeing is
completely different. You have to take a
lot of these pieces of information, these nuggets, I like to call them,
and just piece together a really nice
narrative using data. That skill set is something I learned when I was
working in consulting, and I’ve taken that to Google and really been able to polish a lot of those skills and some of the more
technical skills. Technical and the creative side are what I’ve grown to love. My name is Hallie. I’m
an analytical lead at Google working specifically
in the healthcare vertical.
Reading: Deciding if you should take the speed track
Overview
This reading provides an overview of a speed track we offer to those familiar with data analytics.
If you are brand new to data analytics, you can skip the diagnostic quiz after this reading, and move directly to the next activity: Data collection in our world.
The Google Data Analytics Certificate is a program for anyone. A background in data analysis isn’t required. But you might be someone who has some experience already. If you are this type of learner, we have designed a speed track for this course. Learners who opt for the speed track can refresh on the basic topics and take each of the weekly challenges and the Course Challenge at a faster pace.
To help you decide if you’re a good match for the speed track for this course:
- Take the optional diagnostic quiz.
- Refer to the scoring guide to determine if you’re a good fit for the speed track. A score of 90% or higher is the target goal for someone on the speed track.
- Based on your individual score, follow the recommendations in the scoring guide for your next steps.
Important reminder: If you’re eligible for the speed track, you’re still responsible to complete all graded activities. In order to earn your certificate, you will need an overall score of 80% or higher on all graded materials in the program.
Practice Quiz: Optional: Familiar with data analytics? Take our diagnostic quiz
A data analyst at a construction company is working on a report for a quickly approaching deadline. Why might they choose to analyze only historical data?
The project has a very short time frame.
They would analyze only historical data because the project has a very short time frame.
What are the benefits of data modeling? Select all that apply.
Keep data consistent
Make data easier to understand
Provide a map of how data is organized
Data modeling keeps data consistent, provides a map of how data is organized, and makes data easier to understand. Data modeling is the process of creating a model that is used for organizing data elements and how they relate to one another.
A group of high school students take a survey that asks,” Are you on an athletic team? Please reply yes or no.” What kind of data is being collected?
Boolean
Boolean data would be collected. Boolean data has only two possible values, such as yes or no.
A data analyst is evaluating data to determine whether it is good or bad. Which qualities characterize good data? Select all that apply.
Cited
Current
Comprehensive
Good data is comprehensive, current, and cited.
Imagine that a company uses your personal data as part of a financial transaction. Before it occurs, you are not made aware of the nature and scale of this transaction. What concept of data ethics does this violate?
Openness
Transaction transparency
Currency
Consent
Which of the following are protections afforded by data privacy? Select all that apply.
Preserving a data subject’s information and activity for all data transactions
Providing users the right to inspect, update, or correct their own data
The protections of data privacy include preserving a data subject’s information and activity for all data transactions. They also include providing users the right to inspect, update, and correct their own data.
Which of the following are uses of relational databases? Select all that apply.
Present the same information to each collaborator
Keep data consistent regardless of where it’s accessed
Contain and describe a series of tables that can be connected to form relationships
Relational databases are used to contain and describe a series of tables that can be connected to form relationships. They also present the same information to each collaborator by keeping data consistent regardless of where it’s accessed.
Which statements define primary keys and foreign keys and describe their relationship? Select all that apply.
A primary key is an identifier that references a column in which each value is unique.
A foreign key is a field within a table that’s a primary key in another table.
Primary and foreign keys are two connected identifiers within separate tables in a relational database.
A primary key is an identifier that references a column in which each value is unique. A foreign key is a field within a table that’s a primary key in another table. Primary and foreign keys are two connected identifiers within separate tables in a relational database.
What tasks can data analysts accomplish using metadata? Select all that apply.
Interpret the contents of a database
Evaluate the quality of data
Combine data from more than one source
Data analysts use metadata to combine data, evaluate data, and interpret a database. Metadata is data about data; in database management, it helps data analysts understand the contents of the data within a database.
A data analyst reviews a spreadsheet of boat auction sales to find the last five sailboats sold in Kentucky. What steps would they take in order to narrow the scope? Select all that apply.
Filter out sales outside of Kentucky
Sort by date in descending order
The analyst can filter out sales outside of Kentucky and sort by date in descending order.
You are writing a SQL query to filter data from a database that describes trees in Omaha, Nebraska. You want to only display entries for trees that have a diameter of 30 inches. The name of the table you’re using is Nebraska_trees and the name of the column that shows the diameters of the trees is trunk_diameter. What is the correct query syntax that will retrieve and filter data from this table?
SELECT * FROM Nebraska_trees WHERE trunk_diameter = 30
The correct query is SELECT * FROM Nebraska_trees WHERE trunk_diameter = 30.
Consistent naming conventions describe which properties of a file? Select all that apply.
Content
Creation date
Version
Consistent naming conventions describe the content, creation date, and version of a file.
Collecting data
Video: Data collection in our world
Data is being generated all around the world at an unprecedented rate, and there are many different ways that it can be generated and collected. Some common methods include:
- Online activity: Every time we search the web, watch a video, or post on social media, we are generating data. This data can be collected by companies and organizations to learn more about our interests and habits.
- Surveys and questionnaires: Surveys and questionnaires are a common way to collect data from people directly. They can be used to learn about people’s opinions, experiences, and behaviors.
- Interviews: Interviews can be used to collect data from people in more depth. They can be used to get more personal and detailed information than surveys or questionnaires.
- Scientific observations: Scientists collect data by observing the world around them. This can include observing animal behavior, studying bacteria under a microscope, or conducting experiments.
- Forms: Forms are often used to collect data from people in person or online. They are commonly used by businesses and government agencies to collect information about customers, employees, and citizens.
Data can be used for a variety of purposes, including:
- To improve products and services: Companies can use data to learn more about what their customers want and need. This information can then be used to improve products and services.
- To make better decisions: Data can be used to make informed decisions about everything from business strategy to public policy.
- To conduct research: Scientists and other researchers use data to learn more about the world around us and to solve problems.
It is important to note that data collection and generation should be done in an ethical and responsible manner. This means respecting people’s privacy and ensuring that data is used for its intended purpose.
Data analytics is the process of collecting, cleaning, analyzing, and interpreting data to gain insights and make better decisions. Data collection is the first step in the data analytics process, and it is important to collect the right data in order to get meaningful results.
There are many different ways to collect data, and the best method will depend on the specific needs of the data analysis project. Some common data collection methods include:
- Online surveys: Online surveys are a convenient and efficient way to collect data from a large number of people. They can be used to collect data on a variety of topics, such as demographics, opinions, and behaviors.
- Mobile surveys: Mobile surveys are similar to online surveys, but they are designed to be taken on mobile devices. This makes them a good option for collecting data from people who are on the go.
- In-person interviews: In-person interviews allow researchers to collect more detailed and nuanced data than surveys. They can also be used to build relationships with participants and to get a better understanding of their perspectives.
- Focus groups: Focus groups are small groups of people who are brought together to discuss a particular topic. They can be used to generate new ideas, to explore different perspectives, and to get feedback on products or services.
- Observation: Observation is a data collection method that involves watching and recording people’s behavior. It can be used to collect data on a variety of topics, such as how people use products, how they interact with each other, and how they respond to different stimuli.
- Sensor data: Sensors can be used to collect data on a variety of environmental factors, such as temperature, humidity, and air quality. Sensor data can be collected in real time and can be used to track changes over time.
Once the data has been collected, it needs to be cleaned and analyzed. Data cleaning involves removing any errors or inconsistencies from the data. Data analysis involves using statistical and visualization tools to identify patterns and trends in the data.
The results of the data analysis can then be used to make informed decisions about a variety of topics, such as how to improve products and services, how to allocate resources, and how to develop new policies.
Here are some tips for collecting data effectively:
- Define your goals: What do you hope to learn from the data you collect? Once you know your goals, you can choose the right data collection method and collect the right data.
- Identify your target audience: Who are you trying to collect data from? Once you know your target audience, you can choose a data collection method that is likely to reach them.
- Choose the right data collection method: There are many different data collection methods available. Choose a method that is appropriate for your target audience and that will allow you to collect the data you need.
- Pilot test your data collection instrument: Before you launch your data collection effort, pilot test your data collection instrument (e.g., survey questionnaire, interview guide, observation checklist) with a small group of people to ensure that it is clear, easy to understand, and that it will collect the data you need.
- Collect data ethically: Be sure to collect data in an ethical and responsible manner. This means respecting people’s privacy and ensuring that they understand how their data will be used.
Example of data collection in data analytics:
A company that sells e-commerce software wants to learn more about how their customers use their product. They develop an online survey and ask their customers to complete it. The survey asks customers about their demographics, how they use the software, and what they like and dislike about it.
The company then analyzes the survey results. They find that their customers are mostly small businesses and that they use the software to manage their inventory and orders. The customers also appreciate the software’s ease of use and its customer support.
The company uses the results of the survey to improve their product and to develop new features that their customers will find valuable.
Data collection is an essential part of the data analytics process. By collecting the right data and using it effectively, organizations can gain valuable insights and make better decisions.
To track people's online activities and interests, which method of data collection is most effective?
Cookies
To track people’s online activities and interests, cookies are most effective.
Right now data is being
generated all around the world and we’re
talking tons of data. Every minute of every
day millions of texts and hundreds of
millions of emails are sent. On top of that, millions
of online searches are made and videos viewed and those numbers
are only growing. That’s a lot of data. Let’s learn more about
how it’s made and used. In this video, we’ll talk about
the ways that data can be generated and how industries
collect data themselves. Every piece of
information is data. All that data is usually generated as a result of
our activity in the world. These days, we spend
a lot of time online. With social media
and mobile devices, millions and millions
of people are adding to the huge amount of data out
there, each and every day. Think about it like this. Every digital photo online
is one piece of data. Every photo itself
holds even more data, from the number of pixels to the colors contained in
each of those pixels. But that’s not the only
way data is made. We can also generate data
by collecting information. This data generation and collection comes with a few
more things to think about. It needs to be done
with consideration to ethics so that we maintain
people’s rights and privacy. We’ll learn more
about that later on. For now, let’s check out
a real world example. The United States
Census Bureau uses forms to collect data about
the country’s population. This data is used for
a number of reasons, like funding for schools, hospitals, and fire departments. The Bureau also collects information about things
like U.S. businesses, creating their own
data in the process. The great thing about this
is that others can then use the data for their own
needs, including analysis. The annual business survey is used to figure
out the needs of businesses and how to provide them with resources
to help them succeed. I actually generate data in the analytics I do for
the health care industry. We run a lot of surveys
to learn how patients feel about certain things
related to their health care. For example, one survey
asked how patients feel about telemedicine versus
in-person doctor visits. The data we collected help
the companies we work with improve the care that
their patients receive. Survey data is just one example. There’s all kinds of data
being generated all the time, and there’s lots of
different ways to collect it. Even something as simple as an interview can help
someone collect data. Imagine you’re
in a job interview. To impress the hiring manager, you want to share
information about yourself. The hiring manager
collects that data and analyzes it to help them decide whether to hire you or not. But it goes both ways. You could also collect
your own data about the company to help you decide if the company is a
good fit for you. Or you can use the data
you collect to come up with thoughtful questions
to ask the interviewer. Scientists also generate data. They use a lot of
observations in their work. For example, they might
collect data by studying animal behavior or looking at bacteria under a microscope. Earlier we talked
about the forms that the U.S. Census Bureau
uses to collect data. Forms, questionnaires
and surveys are commonly used ways to
collect and generate data. One thing to note:
data that’s generated online doesn’t always
happen directly. Have you ever wondered why
some online ads seem to make really accurate suggestions or how some websites remember
your preferences? This is done using cookies, which are small files stored on computers that contain
information about users. Cookies can help inform
advertisers about your personal interests and habits based on your
online surfing, without personally
identifying you. As a real world analyst, you’ll have all kinds
of data right at your fingertips and
lots of it too. Knowing how it’s
been generated can help add context to the data, and knowing how to
collect it can make the data analysis
process more efficient. Coming up, you’ll
learn how to decide what data to collect
for your analysis. So stay tuned.
Video: Determining what data to collect
When collecting data for a data analytics project, there are a number of factors to consider, including:
- What data is needed? The first step is to identify the specific data that is needed to address the problem or question being investigated.
- How will the data be collected? There are a variety of data collection methods available, such as surveys, interviews, observations, and sensor data. The best method will depend on the specific needs of the project.
- Where will the data come from? Data can be collected from a variety of sources, such as first-party data (collected by the individual or organization conducting the analysis), second-party data (collected by another organization and sold to the individual or organization conducting the analysis), and third-party data (collected by an organization that is not directly involved in the analysis). The source of the data will affect its reliability and accuracy.
- How much data is needed? It is important to collect enough data to be statistically significant, but too much data can be difficult and expensive to manage and analyze.
- What data type is needed? There are a variety of data types, such as numerical, categorical, and text data. The type of data needed will depend on the specific analysis being performed.
- What time frame is needed for data collection? The time frame for data collection will depend on the specific needs of the project. If an immediate answer is needed, historical data can be used. However, if tracking patterns over time is needed, a longer time frame may be necessary.
It is important to note that data collection should be done in an ethical and responsible manner. This means respecting people’s privacy and ensuring that data is used for its intended purpose.
Tutorial: Determining what data to collect in data analytics
Data collection is the first step in the data analytics process. It is important to collect the right data in order to get meaningful results. However, with a nearly endless amount of data available, it can be difficult to know what data to collect.
Here are some tips for determining what data to collect in data analytics:
- Start by identifying the business question or problem that you are trying to solve. What do you hope to learn from the data? Once you know your goal, you can start to think about what data you need to collect in order to achieve it.
- Consider your audience. Who are you trying to learn about? Once you know your audience, you can start to think about what data is relevant to them.
- Think about the different types of data that are available. There are many different types of data, such as demographic data, behavioral data, and attitudinal data. The type of data you need to collect will depend on your business question or problem.
- Consider the quality of the data. Not all data is created equal. Some data is more reliable and accurate than others. When choosing data, it is important to consider its quality.
- Think about the cost and time required to collect the data. Collecting data can be expensive and time-consuming. It is important to weigh the costs and benefits of collecting different types of data before making a decision.
Here are some examples of how to apply these tips:
- Example 1: A company that sells e-commerce software wants to learn more about how their customers use their product. They identify the following business question: “What features of our product are our customers using most often?” To answer this question, they need to collect data on how customers use the product. This data could include information such as which features customers click on most often, how long they spend using each feature, and whether they complete tasks using the features.
- Example 2: A political campaign wants to learn more about the opinions of voters in their district. They identify the following business question: “What are the most important issues to voters in our district?” To answer this question, they need to collect data on voters’ opinions on a variety of issues. This data could be collected through surveys, interviews, or focus groups.
Once you have collected the data, you can start to analyze it to identify patterns and trends. This information can then be used to make informed decisions about your business or organization.
Here are some additional tips for determining what data to collect:
- Start small. It is better to start with a small amount of data that is relevant to your business question or problem than to collect a large amount of data that is not relevant.
- Be specific. When choosing data, be as specific as possible. For example, instead of collecting data on “customer satisfaction,” collect data on “customer satisfaction with the checkout process.”
- Be flexible. As you learn more about your business question or problem, you may need to adjust the data that you are collecting. Be prepared to adapt your data collection strategy as needed.
By following these tips, you can determine what data to collect in data analytics and get the most out of your data analysis.
In instances when collecting data from an entire population is challenging, data analysts may choose to use what?
A sample
In instances when collecting data from an entire population is challenging, data analysts may choose to use a sample. A sample is a part of a population that is representative of that population.
The data-collection process involves deciding what data to use, determining how much data to collect, and selecting the right data type. Which of the following are also steps in the data-collection process? Select all that apply.
- Determining the time frame
- Choosing data sources
Determining the time frame and choosing data sources are steps in the data collection process.
Welcome back. We’ve talked a lot about all the data out
there in the world. But as a data analyst, you’ll need to decide
what kind of data to collect and use
for every project. With a nearly endless
amount of data out there, this can be quite a bit of a data dilemma, but
there’s good news. In this video, you’ll learn which factors to consider
when collecting data. Usually, you’ll have
a head start in figuring out the right
data for the job, because the data you need
will be given to you, or your business task or problem will narrow down your choices. Let’s start with a question like, what’s causing increased
rush hour traffic in your city? First, you need to know how
the data will be collected. You might use observations
of traffic patterns to count the number of cars
on city streets during particular times. You notice that cars are getting backed up on a specific street. That brings us to data sources. In our traffic example, your observations would
be first-party data. This is data collected by an individual or group
using their own resources. Collecting first-party
data is typically the preferred method because you know exactly where it came from. You might also have
second-party data, which is data
collected by a group directly from its
audience and then sold. In our example, if you aren’t able to collect your own data, you might buy it from
an organization that’s led traffic pattern
studies in your city. This data didn’t start with you, but it’s still reliable
because it came from a source that has experience
with traffic analysis. The same can’t always be said
about third-party data or data collected from outside sources who did not
collect it directly. This data might have
come from a number of different sources before
you investigated it. It might not be as reliable, but that doesn’t mean
it can’t be useful. You’ll just want to make
sure you check it for accuracy, bias, and credibility. Actually, no matter what
kind of data you use, it needs to be inspected for accuracy and trustworthiness. We’ll learn more about
that process later. For now, just remember that
the data you choose should apply to your needs, and it
must be approved for use. As a data analyst, it’s your job to decide
what data to use, and that means choosing the data that can help you
find answers and solve problems and not getting
distracted by other data. In our traffic example, financial data probably
wouldn’t be that helpful, but existing data about high volume traffic
times would be. Okay. Now let’s talk about
how much data to collect. In data analytics, a
population refers to all possible data values
in a certain data set. If you’re analyzing data
about car traffic in a city, your population would be
all the cars in that area. But collecting data from the entire population can
be pretty challenging. That’s why a sample
can be useful. A sample is a part
of a population that is representative
of the population. You might collect a data
sample about one spot in the city and analyze
the traffic there, or you might pull a
random sample from all existing data
in the population. How you choose your sample
will depend on your project. As you collect data,
you’ll also want to make sure you select
the right data type. For traffic data, an
appropriate data type could be the dates of traffic records
stored in a date format. The dates could help you figure what days of the week there is likely to be a high volume
of traffic in the future. We’ll explore this topic
in more detail soon. Finally, you need to determine the time frame for
data collection. In our example, if you needed
an answer immediately, you’d have to use
historical data, which is data that
already exists. But let’s say you needed to track traffic patterns over
a long period of time. That might affect
the other decisions you make during data collection. Now you know more about the different data
collection considerations you’ll use as a data analyst. Because of that,
you’ll be able to find the right data when you start
collecting it yourself. There’s still more to learn about data collection, so stay tuned.
Reading: Selecting the right data
Overview
Following are some data-collection considerations to keep in mind for your analysis:
How the data will be collected
Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party. Data that you collect yourself is called first-party data.
Data sources
If you don’t collect the data using your own resources, you might get data from second-party or third-party data providers. Second-party data is collected directly by another group and then sold. Third-party data is sold by a provider that didn’t collect the data themselves. Third-party data might come from a number of different sources.
Solving your business problem
Datasets can show a lot of interesting information. But be sure to choose data that can actually help solve your problem question. For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates.
How much data to collect
If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs.
Time frame
If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use historical data that already exists.
Use the flowchart below if data collection relies heavily on how much time you have:
Practice Quiz: Test your knowledge on collecting data
Which method of data-collection is most commonly used by scientists?
Observations
Observation is the method of data-collection most often used by scientists.
Organizations such as the U.S. Centers for Disease Control (CDC) often use data collected from hospitals. What kind of data is the CDC using if it is collected by hospitals, then sold to the CDC for its own analysis?
Second-party data
Data gathered by hospitals, then collected by the CDC, is an example of second-party data.
Fill in the blank: In data analytics, a _____ refers to all possible data values in a certain dataset.
population
In data analytics, a population refers to all possible data values in a certain dataset.
Differentiate between data formats and structures
Video: Discover data formats
The article is about data formats and how they are used in data analysis. There are two main types of data: quantitative and qualitative. Quantitative data can be measured or counted, while qualitative data cannot. Quantitative data can be further divided into discrete and continuous data. Discrete data is counted and has a limited number of values, while continuous data can be measured using a timer and has a decimal value.
Qualitative data can be further divided into nominal and ordinal data. Nominal data is categorized without a set order, while ordinal data has a set order or scale.
There are two main types of data sources: internal and external. Internal data is data that lives within a company’s own systems, while external data is data that lives and is generated outside of an organization.
Data can also be structured or unstructured. Structured data is data that is organized in a certain format, such as rows and columns, while unstructured data is not organized in any easily identifiable manner.
Data formats are important because they make data more easily searchable and analysis-ready. Data analysts typically work with structured data, which is usually in the form of a table, spreadsheet, or relational database. However, they may also come across unstructured data, such as audio and video files.
Tutorial on “Discover data formats” in Data analysis
What is a data format?
A data format is the way that data is organized and stored. There are many different data formats, each with its own advantages and disadvantages. Some common data formats include:
- CSV (comma-separated values): CSV is a simple and easy-to-read format that is often used to store data in spreadsheets.
- JSON (JavaScript Object Notation): JSON is a lightweight format that is often used to store data in web applications.
- XML (Extensible Markup Language): XML is a structured format that is often used to store data in complex documents.
- Parquet: Parquet is a columnar storage format that is often used to store large datasets in data warehouses.
- ORC (Optimized Row Columnar): ORC is a columnar storage format that is similar to Parquet.
Why is it important to know the format of a data file?
Knowing the format of a data file is important because it allows you to use the appropriate tools to read and analyze the data. For example, if you have a CSV file, you can use a spreadsheet program like Excel to read and analyze the data. However, if you try to open a CSV file in a word processing program like Microsoft Word, you will not be able to read the data correctly.
How to discover the format of a data file
There are a few different ways to discover the format of a data file:
- Look at the file extension: The file extension is the three letters at the end of a file name. For example, a CSV file will have the file extension
.csv
. - Open the file in a text editor: If you open the file in a text editor, you should be able to see the format of the data. For example, if the data is in CSV format, you will see that the data is separated by commas.
- Use a data analysis tool: Many data analysis tools can automatically detect the format of a data file. For example, if you open a CSV file in Excel, Excel will automatically detect that the file is in CSV format.
Examples of data formats in data analytics
Here are a few examples of how data formats are used in data analytics:
- A data analyst might use a CSV file to store data from a customer survey.
- A data analyst might use a JSON file to store data from a web application.
- A data analyst might use an XML file to store data from a financial report.
- A data analyst might use a Parquet file to store data from a data warehouse.
- A data analyst might use an ORC file to store data from a data lake.
How to discover data formats in a data science project
When working on a data science project, there are a few things you can do to discover the format of the data you are working with:
- Ask the person who provided the data. If you have access to the person who provided the data, you can simply ask them what format the data is in. This is often the easiest and fastest way to find out the format of the data.
- Look at the documentation. If the data is from a public source, there may be documentation available that describes the format of the data. This documentation may be available on the website where you downloaded the data or in a separate file.
- Inspect the data. If you don’t have access to the person who provided the data and there is no documentation available, you can inspect the data yourself to try to determine the format. You can do this by opening the data in a text editor or by using a data analysis tool.
Once you have discovered the format of the data, you can use the appropriate tools to read and analyze the data.
Conclusion
Data formats are an important part of data analytics. Knowing the format of a data file allows you to use the appropriate tools to read and analyze the data. There are many different data formats, each with its own advantages and disadvantages. Some common data formats include CSV, JSON, XML, Parquet, and ORC.
When working on a data science project, there are a few things you can do to discover the format of the data you are working with:
- Ask the person who provided the data.
- Look at the documentation.
- Inspect the data.
Once you have discovered the format of the data, you can use the appropriate tools to read and analyze the data.
An entertainment website displays a star rating for a movie based on user reviews. Users can select from one to five whole stars to rate the movie. The star rating is an example of what type of data? Select all that apply.
Ordinal, Discrete
The star rating is an example of ordinal data because the number of stars are in order of how much each person liked the movie. It’s also an example of discrete data because a person has to choose a full star measure; half-stars weren’t an option.
The use of external data is particularly valuable in which circumstances?
When analysis depends on as many data sources as possible
External data is particularly valuable when an analysis depends on as many sources as possible.
I don’t know about you, but when I’m choosing a movie to watch, I sometimes get stuck
between a couple of choices. If I’m in the mood for
excitement or suspense, I might go for a thriller, but if I need a good laugh, I’ll choose a comedy. If I really can’t decide
between two movies, I might even use some of
my data analysis skills to compare and contrast them. Come to think of it,
there really needs to be more movies about data analysts. I’d watch that, but since we can’t watch a
movie about data, at least not yet, we’ll do the next best thing:
watch data about movies! We’re going to take a look at this spreadsheet with movie data. We know we can compare different
movies and movie genres. Turns out, you can
do the same with data and data formats. Let’s use our movie
data spreadsheet to understand how that works. We’ll start with quantitative
and qualitative data. If we check out column A, we’ll
find titles of the movies. This is qualitative data
because it can’t be counted, measured, or easily
expressed using numbers. Qualitative data is
usually listed as a name, category, or description. In our spreadsheet,
the movie titles and cast members are
qualitative data. Next up is quantitative data, which can be measured or counted and then
expressed as a number. This is data with a certain
quantity, amount, or range. In our spreadsheet here, the last two columns
show the movies’s budget and box office revenue. The data in these columns
is listed in dollars, which can be counted, so we know that data
is quantitative. We can go even deeper
into quantitative data and break it down into
discrete or continuous data. Let’s check out
discrete data first. This is data that’s counted and has a limited
number of values. Going back to our spreadsheet, we’ll find each movie’s budget and box office returns
in columns M and N. These are both examples of discrete data that can be counted and have a
limited number of values. For example, the amount of money a movie makes can
only be represented with exactly two digits after the decimal to represent cents. There can’t be anything
between one and two cents. Continuous data can be
measured using a timer, and its value can be shown as a
decimal with several places. Let’s imagine a movie
about data analysts that I’m definitely going
to star in someday. You could express that
movie’s run time as 110.0356 minutes. You could even add
fractional data after the decimal point
if you needed to. There’s also nominal
and ordinal data. Nominal data is a type of qualitative data that’s
categorized without a set order. In other words, this data
doesn’t have a sequence. Here’s a quick example. Let’s say you’re collecting
data about movies. You ask people if they’ve
watched a given movie. Their responses would be in
the form of nominal data. They could respond “Yes,” “No,” or “Not sure.” These choices don’t have
a particular order. Ordinal data, on the other hand, is a type of qualitative data
with a set order or scale. If you asked a group of people to rank a movie from 1 to 5, some might rank it as a 2, others a 4, and so on. These rankings are in order of how much each person
liked the movie. Now let’s talk about
internal data, which is data that lives within
a company’s own systems. For example, if a movie
studio had compiled all of the data in the spreadsheet using only their own
collection methods, then it would be
their internal data. The great thing about
internal data is that it’s usually more reliable
and easier to collect, but in this spreadsheet, it’s more likely that
the movie studio had to use data owned or shared by other studios and sources because it includes
movies they didn’t make. That means they’d be
collecting external data. External data is, you guessed it, data that lives and is generated outside
of an organization. External data becomes
particularly valuable when your analysis depends on
as many sources as possible. A great thing about this data
is that it’s structured. Structured data is data that’s organized in
a certain format, such as rows and columns. Spreadsheets and
relational databases are two examples of software that can store data
in a structured way. You might remember our
earlier exploration of structured thinking, which helps you add a
framework to a problem so that you can solve it in an
organized and logical manner. You can think of structured
data in the same way. Having a framework for
the data makes the data easily searchable and
more analysis-ready. As a data analyst, you’ll work with a lot
of structured data, which will usually
be in the form of a table, spreadsheet or
relational database, but sometimes you’ll come
across unstructured data. This is data that
is not organized in any easily identifiable manner. Audio and video files are
examples of unstructured data because there’s no clear way to identify or organize
their content. Unstructured data might
have internal structure, but the data doesn’t
fit neatly in rows and columns like structured
data. And there you have it! Hopefully you’re now
more familiar with data formats and how you
might use them in your work. In just a bit, you’ll continue to explore
structured data and learn even more about the data you’ll use most often as an analyst. Coming soon to a screen near you.
Reading: Data formats in practice
Overview
When you think about the word “format,” a lot of things might come to mind. Think of an advertisement for your favorite store. You might find it in the form of a print ad, a billboard, or even a commercial. The information is presented in the format that works best for you to take it in. The format of a dataset is a lot like that, and choosing the right format will help you manage and use your data in the best way possible.
Data format examples
Primary vs. Secondary
Data Format Classification | Definition | Examples |
---|---|---|
Primary data | Collected by a researcher from first-hand sources | – Data from an interview you conducted – Data from a survey returned from 20 participants – Data from questionnaires you got back from a group of workers |
Secondary data | Gathered by other people or from other research | – Data you bought from a local data analytics firm’s customer profiles – Demographic data collected by a university – Census data gathered by the federal government |
Internal vs. External
Data Format Classification | Definition | Examples |
---|---|---|
Internal data | Data that lives inside a company’s own systems | – Wages of employees across different business units tracked by HR – Sales data by store location – Product inventory levels across distribution centers |
External data | Data that lives outside of a company or organization | – National average wages for the various positions throughout your organization – Credit reports for customers of an auto dealership |
Continuous vs Discrete
Data Format Classification | Definition | Examples |
---|---|---|
Continuous data | Data that is measured and can have almost any numeric value | – Height of kids in third grade classes (52.5 inches, 65.7 inches) – Runtime markers in a video – Temperature |
Discrete data | Data that is counted and has a limited number of values | – Number of people who visit a hospital on a daily basis (10, 20, 200) – Room’s maximum capacity allowed – Tickets sold in the current month |
Qualitative vs. Quantitative
Data Format Classification | Definition | Examples |
---|---|---|
Qualitative | Subjective and explanatory measures of qualities and characteristics | – Exercise activity most enjoyed – Favorite brands of most loyal customers – Fashion preferences of young adults |
Quantitative | Specific and objective measures of numerical facts | – Percentage of board certified doctors who are women – Population of elephants in Africa – Distance from Earth to Mars |
Nominal vs. Ordinal
Data Format Classification | Definition | Examples |
---|---|---|
Nominal | A type of qualitative data that isn’t categorized with a set order | – First time customer, returning customer, regular customer – New job applicant, existing applicant, internal applicant – New listing, reduced price listing, foreclosure |
Ordinal | A type of qualitative data with a set order or scale | – Movie ratings (number of stars: 1 star, 2 stars, 3 stars) – Ranked-choice voting selections (1st, 2nd, 3rd) – Income level (low income, middle income, high income) |
Structured vs. Unstructured
Data Format Classification | Definition | Examples |
---|---|---|
Structured data | Data organized in a certain format, like rows and columns | – Expense reports – Tax returns – Store inventory |
Unstructured data | Data that isn’t organized in any easily identifiable manner | – Social media posts – Emails – Videos |
Practice Quiz: Self-Reflection: Unstructured data
Overview
Now that you have learned about unstructured data, you can pause for a moment and apply what you are learning. In this self-reflection, you will complete tasks with a neural network, consider your thoughts about data structuring, and respond to brief questions.
This self-reflection will help you develop insights into your own learning and prepare you to apply your knowledge of data structures to your interactions with unstructured data. As you complete tasks with a neural network website, you will explore concepts, practices, and principles to help refine your understanding and reinforce your learning. You’ve done the hard work, so make sure to get the most out of it: This reflection will help your knowledge stick!
Data structuring with Quick, Draw!
In this self-reflection, you will explore the nature of unstructured data through a crowd-sourced dataset.
Quick, Draw! is a neural network dataset that has millions of pictures drawn by people separated into categories like plants, animals, or vehicles. On the Quick, Draw! website, you can view a large dataset of hundreds of thousands of real doodles made by people on the internet. You can also draw your own. Through this process, you can train a neural network to recognize objects and learn more about the importance of structured data.
1. Visit the Quick, Draw! website.
2. In the upper left-hand corner, you will notice a drop-down menu like this:
Select a type of doodle to begin.
3. Click on different pictures to see details about the images on your screen. For example, there are more than one hundred thousand different drawings of elephants. Scroll through the list and see if there are any that don’t belong. If you find one that doesn’t match the intended object, click on it and select Flag as inappropriate.
4. Explore other categories of drawings. Select three categories that interest you and check out their doodles.
5. Optional: Explore further. Click Get the data to visit the GitHub page containing the entire dataset. As you become more familiar with data projects and start creating your own, you can return to this dataset and analyze it yourself. Click Play the game to draw your own doodles and contribute to Quick, Draw!’s dataset.
6. When you’re done, answer the reflection questions below.
Video: Understanding structured data
Structured data is data that is organized in a format like rows and columns. It works nicely within a data model, which is a model that is used for organizing data elements and how they relate to one another. Data elements are pieces of information, such as people’s names, account numbers, and addresses. Data models help to keep data consistent and provide a map of how data is organized. This makes it easier for analysts and other stakeholders to make sense of their data and use it for business purposes. Structured data is also useful for databases, making it easy for analysts to enter, query, and analyze the data whenever they need to. Structured data can also be applied directly to charts, graphs, heat maps, dashboards, and most other visual representations of data. Spreadsheets and databases that store data sets are widely used sources of structured data.
Understanding structured data in data analysis
Structured data is data that is organized in a specific format, such as rows and columns. This makes it easy for computers to read and process structured data. Structured data is often stored in databases and spreadsheets.
Examples of structured data
Here are some examples of structured data:
- Customer data in a CRM system
- Product data in an e-commerce database
- Financial data in an accounting system
- Sensor data from IoT devices
- Social media data
Why is structured data important for data analysis?
Structured data is important for data analysis because it is easy to search, sort, and filter. This makes it possible to quickly identify patterns and trends in the data. Structured data can also be used to create accurate and reliable reports.
How to use structured data for data analysis
To use structured data for data analysis, you will need to:
- Identify the data sources: The first step is to identify the data sources that contain the data you want to analyze. This could be a database, spreadsheet, or other type of file.
- Clean and prepare the data: Once you have identified the data sources, you will need to clean and prepare the data. This may involve removing duplicate rows, correcting errors, and converting the data to a consistent format.
- Analyze the data: Once the data is clean and prepared, you can start to analyze it. This may involve using statistical software to calculate summary statistics, create visualizations, and build models.
- Interpret the results: Once you have analyzed the data, you need to interpret the results. This involves identifying any patterns or trends in the data and drawing conclusions.
Tips for using structured data for data analysis
Here are some tips for using structured data for data analysis:
- Use a data analysis tool: A data analysis tool will make it easier to clean, prepare, and analyze structured data. There are many different data analysis tools available, both free and commercial.
- Start with a clear question: Before you start analyzing the data, it is important to have a clear question in mind. This will help you to focus your analysis and get the most out of the data.
- Use visualization: Visualization can be a powerful tool for understanding structured data. Charts and graphs can help you to identify patterns and trends that might be difficult to see in the raw data.
- Be critical of the results: It is important to be critical of the results of your analysis. Consider the quality of the data, the limitations of the analysis methods, and the potential for bias.
Conclusion
Structured data is an important part of data analysis. It is easy to read and process, and it can be used to create accurate and reliable reports. By following the tips above, you can use structured data to gain valuable insights into your business or organization.
Hi, great to see you again! Earlier, we compared some data formats,
including structured and unstructured data. Most of the data being generated
right now is actually unstructured. Audio files, video files,
emails, photos, and social media are all examples
of unstructured data. These can be harder to analyze
in their unstructured format. But here’s the good news, you’ll be working with structured
data most of the time. For example, if you need to analyze data
about the unstructured data in emails, photos, and social media sites, it’ll most likely be structured for
analysis before you even get to it. Because of that, I want to explore
structured data a bit more. As a quick refresher, structured data is
data organized in a format like rows and columns. But there’s definitely
more to it than that. Structured data works nicely within a data
model, which is a model that is used for organizing data elements and
how they relate to one another. What are data elements? They’re pieces of information,
such as people’s names, account numbers, and addresses. Data models help to keep
data consistent and provide a map of how data is organized. This makes it easier for analysts and other stakeholders to make sense of their
data and use it for business purposes. In addition to working well within data
models, structured data is also useful for databases. This makes it easy for
analysts to enter, query, and analyze the data whenever they need to. This also helps make data
visualization pretty easy because structured data can be applied
directly to charts, graphs, heat maps, dashboards and most other
visual representations of data. Alright, so now we know
that spreadsheets and databases that store data sets are widely
used sources of structured data. After you explore some
other data structures, you’ll check out more data
types using a spreadsheet. The adventure continues!
Reading: The structure of data
Overview
Data is everywhere and it can be stored in lots of ways. Two general categories of data are:
- Structured data: Organized in a certain format, such as rows and columns.
- Unstructured data: Not organized in any easy-to-identify way.
For example, when you rate your favorite restaurant online, you’re creating structured data. But when you use Google Earth to check out a satellite image of a restaurant location, you’re using unstructured data.
Here’s a refresher on the characteristics of structured and unstructured data:
Structured data
As we described earlier, structured data is organized in a certain format. This makes it easier to store and query for business needs. If the data is exported, the structure goes along with the data.
Unstructured data
Unstructured data can’t be organized in any easily identifiable manner. And there is much more unstructured than structured data in the world. Video and audio files, text files, social media content, satellite imagery, presentations, PDF files, open-ended survey responses, and websites all qualify as types of unstructured data.
The fairness issue
The lack of structure makes unstructured data difficult to search, manage, and analyze. But recent advancements in artificial intelligence and machine learning algorithms are beginning to change that. Now, the new challenge facing data scientists is making sure these tools are inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily weighted and/or represented than others. And as you’re learning, an unfair dataset does not accurately represent the population, causing skewed outcomes, low accuracy levels, and unreliable analysis.
Reading: Data modeling levels and techniques
Overview
This reading introduces you to data modeling and different types of data models. Data models help keep data consistent and enable people to map out how data is organized. A basic understanding makes it easier for analysts and other stakeholders to make sense of their data and use it in the right ways.
Important note: As a junior data analyst, you won’t be asked to design a data model. But you might come across existing data models your organization already has in place.
What is data modeling?
Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. You can think of data modeling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole.
Levels of data modeling
Each level of data modeling has a different level of detail.
- Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn’t contain technical details.
- Logical data modeling focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn’t spell out actual names of database tables. That’s the job of a physical data model.
- Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.
More information can be found in this comparison of data models.
Data-modeling techniques
There are a lot of approaches when it comes to developing data models, but two common methods are the Entity Relationship Diagram (ERD) and the Unified Modeling Language (UML) diagram. ERDs are a visual way to understand the relationship between entities in the data model. UML diagrams are very detailed diagrams that describe the structure of a system by showing the system’s entities, attributes, operations, and their relationships. As a junior data analyst, you will need to understand that there are different data modeling techniques, but in practice, you will probably be using your organization’s existing technique.
You can read more about ERD, UML, and data dictionaries in this data modeling techniques article.
Data analysis and data modeling
Data modeling can help you explore the high-level details of your data and how it is related across the organization’s information systems. Data modeling sometimes requires data analysis to understand how the data is put together; that way, you know how to map the data. And finally, data models make it easier for everyone in your organization to understand and collaborate with you on your data. This is important for you and everyone on your team!
Practice Quiz: Test your knowledge on data formats and structures
Fill in the blank: The running time of a movie is an example of _____ data.
continuous
Running times of movies are an example of continuous data, which is measured and can have almost any numeric value.
What are the characteristics of unstructured data? Select all that apply.
Is not organized
May have an internal structure
Unstructured data is not organized, although it may have an internal structure.
Structured data enables data to be grouped together to form relations. This makes it easier for analysts to do what with the data? Select all that apply.
Search
Analyze
Store
Structured data that is grouped together to form relations enables analysts to more easily store, search, and analyze the data.
Which of the following is an example of unstructured data?
Email message
An example of unstructured data is an email message. Other examples of unstructured data are video files and social media content.
Explore data types, fields, and values
Video: Know the type of data you’re working with
Data types in spreadsheets can be one of three things: a number, a text or string, or a Boolean.
Number data types are used to represent numerical values, such as search interests, currency, and dates.
Text or string data types are used to represent sequences of characters and punctuation, such as names, addresses, and phone numbers.
Boolean data types are used to represent two possible values: true or false.
It is important to know the data type of each cell in a spreadsheet, as this will affect how the data can be used and manipulated. For example, if you try to perform a mathematical operation on a text cell, you will get an error.
One common issue that people encounter in spreadsheets is mistaking data types with cell values. For example, if you try to average a column of text cells, you will get an error. This is because the text cells are not numbers, and therefore cannot be averaged.
To avoid errors, it is important to make sure that you are using the correct data type for each cell in your spreadsheet. You can do this by checking the cell format or by using the data type validation feature.
Know the type of data you’re working with in Data analytics
The type of data you’re working with is one of the most important factors to consider when conducting data analytics. The type of data will dictate the types of analyses you can perform and the types of conclusions you can draw.
There are two main types of data: qualitative and quantitative.
- Qualitative data is data that is descriptive in nature, such as names, addresses, and opinions. Qualitative data cannot be measured or counted.
- Quantitative data is data that is numerical in nature, such as height, weight, and temperature. Quantitative data can be measured and counted.
Within these two main categories, there are many different types of data. Here are some examples:
- Demographic data: This type of data includes information about people, such as age, gender, education, and income. Demographic data is often used to understand the characteristics of a population or to segment a population into different groups.
- Behavioral data: This type of data tracks how people interact with products, services, and content. Behavioral data can be used to understand customer preferences, identify trends, and improve product development.
- Financial data: This type of data includes information about a company’s financial performance, such as revenue, expenses, and profits. Financial data is often used to assess the financial health of a company and to make investment decisions.
- Operational data: This type of data tracks the performance of business processes, such as order fulfillment and customer service. Operational data can be used to identify bottlenecks, improve efficiency, and reduce costs.
Once you know the type of data you’re working with, you can start to think about the types of analyses you can perform. For example, if you have quantitative data, you can perform statistical analysis to identify trends and patterns. If you have qualitative data, you can perform content analysis to identify common themes and ideas.
Here are some tips for knowing the type of data you’re working with in data analytics:
- Identify the source of the data. Where did the data come from? What is the purpose of the data?
- Look at the data. What kind of information is included in the data? Is the data in a numerical or textual format?
- Check the data type of each column in the data set. This can be done using a spreadsheet program or a programming language such as Python.
- Look for outliers and missing values. Outliers are data points that are significantly different from the rest of the data. Missing values are data points that are missing from the data set.
It is important to know the type of data you’re working with in data analytics in order to perform the correct analysis and get accurate results.
Here are some examples of how different types of data can be used in data analytics:
- A retail company could use demographic data to understand the characteristics of its customers. This information could then be used to develop targeted marketing campaigns.
- A technology company could use behavioral data to track how users interact with its products. This information could then be used to improve the user experience and develop new features.
- A financial services company could use financial data to assess the risk of a loan applicant. This information could then be used to make lending decisions.
- A manufacturing company could use operational data to track the performance of its production lines. This information could then be used to identify bottlenecks and improve efficiency.
By understanding the different types of data and how they can be used in data analytics, you can make better decisions about how to collect, clean, and analyze your data.
By now you’ve learned
a lot about data. From generated data, to
collected data, to data formats, it’s good to know
as much as you can about the data you’ll
use for analysis. In this video, we’ll
talk about another way you can describe
data: the data type. A data type is a
specific kind of data attribute that tells what kind of
value the data is. In other words, a data type tells you what kind of data
you’re working with. Data types can be different depending on the query
language you’re using. For example, SQL allows for different data types depending on which database you’re using. For now though, let’s focus on the data types that you’ll
use in spreadsheets. To help us out, we’ll use a spreadsheet that’s
already filled with data. We’ll call it “Worldwide Interests in Sweets
through Google Searches.” Now a data type in a spreadsheet can be
one of three things: a number, a text or string, or a Boolean. You might find spreadsheet
programs that classify them a bit differently or
include other types, but these value types
cover just about any data you’ll find
in spreadsheets. We’ll look at all of
these in just a bit. Looking at columns B, D, and F, we find number data types. Each number represents
the search interest for the terms “cupcakes,” “ice cream,” and “candy”
for a specific week. The closer a number is to 100, the more popular that search
term was during that week. One hundred represents
peak popularity. Keep in mind that in this case, 100 is a relative value, not the actual
number of searches. It represents the maximum number of searches during
a certain time. Think of it like a
percentage on a test. All other searches are then
also valued out of 100. You might notice this in
other data sets as well. Gold star for 100! If you needed to, you could
change the numbers into percents or other
formats, like currency. These are all examples
of number data types. In column H, the data shows the most popular
treat for each week, based on the search data. So as we’ll find in cell H4 for the week beginning
July 28th, 2019, the most popular
treat was ice cream. This is an example
of a text data type, or a string data type, which is a sequence
of characters and punctuation that contains
textual information. In this example, that information would be the treats
and people’s names. These can also
include numbers, like phone numbers or numbers
in street addresses. But these numbers wouldn’t
be used for calculations. In this case they’re treated
like text, not numbers. In columns C, E, and G, it seems like
we’ve got some text. But the text here isn’t a
text or string data type. Instead, it’s a
Boolean data type. A Boolean data type is a data type with only
two possible values: true or false. Columns C, E, and G show Boolean data for whether the search interest for each week, is at least 50 out of 100. Here’s how it works.
To get this data, we’ve created a formula
that calculates whether the search interest
data in columns B, D, and F is 50 or greater. In cell B4, the search
interest is 14. In cell C4, we find
the word false because, for this week of data, the search interest
is less than 50. For each cell in columns C, E, and G, the only two possible
values are true or false. We could change the formula so other words appear in
these cells instead, but it’s still Boolean data. You’ll get a chance to read more about the Boolean data type soon. Let’s talk about a
common issue that people encounter in spreadsheets: mistaking data types
with cell values. For example, in cell B57, we can create a formula to
calculate data in other cells. This will give us the average
of the search interests in cupcakes across all
weeks in the dataset, which is about 15. The formula works because we calculated using a
number data type. But if we tried it with a
text or string data type, like the data in column
C, we’d get an error. Error values usually
happen if a mistake is made in entering the
values in the cells. The more you know your data
types and which ones to use, the less errors you’ll run into. There you have it, a
data type for everyone. We’re not done yet. Coming up, we’ll go deeper into the
relationship between data types, fields, and values. See you soon.
Reading: Understanding Boolean logic
Reading
In this reading, you will explore the basics of Boolean logic and learn how to use multiple conditions in a Boolean statement. These conditions are created with Boolean operators, including AND, OR, and NOT. These operators are similar to mathematical operators and can be used to create logical statements that filter your results. Data analysts use Boolean statements to do a wide range of data analysis tasks, such as creating queries for searches and checking for conditions when writing programming code.
Boolean logic example
Imagine you are shopping for shoes, and are considering certain preferences:
- You will buy the shoes only if they are pink and grey
- You will buy the shoes if they are entirely pink or entirely grey, or if they are pink and grey
- You will buy the shoes if they are grey, but not if they have any pink
Below are Venn diagrams that illustrate these preferences. AND is the center of the Venn diagram, where two conditions overlap. OR includes either condition. NOT includes only the part of the Venn diagram that doesn’t contain the exception.
The AND operator
Your condition is “If the color of the shoe has any combination of grey and pink, you will buy them.” The Boolean statement would break down the logic of that statement to filter your results by both colors. It would say “IF (Color=”Grey”) AND (Color=”Pink”) then buy them.” The AND operator lets you stack multiple conditions.
Below is a simple truth table that outlines the Boolean logic at work in this statement. In the Color is Grey column, there are two pairs of shoes that meet the color condition. And in the Color is Pink column, there are two pairs that meet that condition. But in the If Grey AND Pink column, there is only one pair of shoes that meets both conditions. So, according to the Boolean logic of the statement, there is only one pair marked true. In other words, there is one pair of shoes that you can buy.
Color is Grey | Color is Pink | If Grey AND Pink, then Buy | Boolean Logic |
---|---|---|---|
Grey/True | Pink/True | True/Buy | True AND True = True |
Grey/True | Black/False | False/Don’t buy | True AND False = False |
Red/False | Pink/True | False/Don’t buy | False AND True = False |
Red/False | Green/False | False/Don’t buy | False AND False = False |
The OR operator
The OR operator lets you move forward if either one of your two conditions is met. Your condition is “If the shoes are grey or pink, you will buy them.” The Boolean statement would be “IF (Color=”Grey”) OR (Color=”Pink”) then buy them.” Notice that any shoe that meets either the Color is Grey or the Color is Pink condition is marked as true by the Boolean logic. According to the truth table below, there are three pairs of shoes that you can buy.
Color is Grey | Color is Pink | If Grey OR Pink, then Buy | Boolean Logic |
---|---|---|---|
Red/False | Black/False | False/Don’t buy | False OR False = False |
Black/False | Pink/True | True/Buy | False OR True = True |
Grey/True | Green/False | True/Buy | True OR False = True |
Grey/True | Pink/True | True/Buy | True OR True = True |
The NOT operator
Finally, the NOT operator lets you filter by subtracting specific conditions from the results. Your condition is “You will buy any grey shoe except for those with any traces of pink in them.” Your Boolean statement would be “IF (Color=”Grey”) AND (Color=NOT “Pink”) then buy them.” Now, all of the grey shoes that aren’t pink are marked true by the Boolean logic for the NOT Pink condition. The pink shoes are marked false by the Boolean logic for the NOT Pink condition. Only one pair of shoes is excluded in the truth table below.
Color is Grey | Color is Pink | Boolean Logic for NOT Pink | If Grey AND (NOT Pink), then Buy | Boolean Logic |
---|---|---|---|---|
Grey/True | Red/False | Not False = True | True/Buy | True AND True = True |
Grey/True | Black/False | Not False = True | True/Buy | True AND True = True |
Grey/True | Green/False | Not False = True | True/Buy | True AND True = True |
Grey/True | Pink/True | Not True = False | False/Don’t buy | True AND False = False |
The power of multiple conditions
For data analysts, the real power of Boolean logic comes from being able to combine multiple conditions in a single statement. For example, if you wanted to filter for shoes that were grey or pink, and waterproof, you could construct a Boolean statement such as: “IF ((Color = “Grey”) OR (Color = “Pink”)) AND (Waterproof=“True”).” Notice that you can use parentheses to group your conditions together.
Whether you are doing a search for new shoes or applying this logic to your database queries, Boolean logic lets you create multiple conditions to filter your results. And now that you know a little more about how Boolean logic is used, you can start using it!
Additional Reading/Resources
- Learn about who pioneered Boolean logic in this historical article: Origins of Boolean Algebra in the Logic of Classes.
- Find more information about using AND, OR, and NOT from these tips for searching with Boolean operators.
Video: Data table components
A data table, or tabular data, is arranged in rows and columns. The rows can be called “records” and the columns can be called “fields”. Each record has the same fields as the other records in the same order. Each field has a specific data type, such as text, number, or Boolean. Data tables are used in a variety of applications, such as music playlists, calendars, and email inboxes. As a data analyst, you will work with many different data tables and it is important to understand the structures of the tables you are working with.
Data table components in Data analytics
Data tables are a fundamental component of data analytics. They are used to organize and display data in a way that is easy to understand and analyze. Data tables can be used to store and analyze a variety of data types, including quantitative data, qualitative data, and structured data.
Data tables are typically composed of the following components:
- Header row: The header row contains the names of the columns in the table.
- Data rows: The data rows contain the actual data.
- Footer row: The footer row can contain summary statistics for the data, such as the mean, median, and mode.
Data tables can also include other components, such as:
- Row labels: Row labels can be used to identify the individual rows in the table.
- Column labels: Column labels can be used to identify the individual columns in the table.
- Data types: Data types can be used to specify the type of data that is stored in each column.
- Filters: Filters can be used to display only a subset of the data in the table.
- Sorts: Sorts can be used to arrange the data in the table in a specific order.
Data tables can be used to perform a variety of data analysis tasks, such as:
- Descriptive statistics: Data tables can be used to calculate descriptive statistics for data, such as the mean, median, mode, and standard deviation.
- Correlation analysis: Data tables can be used to calculate the correlation between two or more variables.
- Regression analysis: Data tables can be used to perform regression analysis to identify the relationship between two or more variables.
- Data visualization: Data tables can be used to create data visualizations, such as charts and graphs.
Data tables are a powerful tool for data analytics. By understanding the different components of data tables and how to use them, you can perform a variety of data analysis tasks and gain insights from your data.
Here are some tips for using data tables effectively in data analytics:
- Use descriptive column labels. The column labels should clearly describe the data that is contained in each column.
- Use consistent data types. All of the data in a column should be of the same data type. This will make it easier to perform data analysis tasks.
- Use filters and sorts. Filters and sorts can be used to display only a subset of the data in the table and to arrange the data in a specific order. This can make it easier to identify patterns and trends in the data.
- Use data visualization. Data visualizations can be used to communicate the findings of your data analysis in a clear and concise way.
By following these tips, you can use data tables effectively to perform data analytics and gain insights from your data.
Here’s a riddle for you. What do a music playlist, a calendar
agenda, and an email inbox have in common? I’ll give you a hint. It’s not a weekly jam session. The answer is they’re
all arranged in tables. Go ahead and check out your email inbox or
a favorite playlist, or look at your calendar agenda. There’s tables in every one! A data table, or tabular data,
has a very simple structure. It’s arranged in rows and columns. You can call the rows “records”
and the columns “fields.” They basically mean the same thing, but records and fields can be used for
any kind of data table, while rows and columns are usually reserved for
spreadsheets. When talking about structured databases, people in data analytics usually
go with “records” and “fields.” Sometimes a field can also
refer to a single piece of data, like the value in a cell. In any case, you’ll hear both versions of these terms
used throughout this program and your job. Let’s go back to our playlist example. We’ll use the new terms
we just introduced. So each song is a record. Each record has the same fields as
the other records in the same order. In other words, the playlist has
the same information about each song. Each song characteristic,
like the title and the artist, is a field. Each separate field has
the same data type, but different fields can have different types. Let me show you what I mean. For the song list, the song titles
are a text or string type, while the song’s length could be a number type
if you’re using it for calculations. Or it could be a date and time type. The column for favorites is Boolean since it has two possible values:
favorite or not favorite. We can view spreadsheets in the same way. The records in a spreadsheet might
be about all sorts of things: clients, products, invoices,
or anything else. Each record has several fields, which reveal more about
the clients, products, or invoices. The value in every cell contains
a specific piece of data, like the address of a client or
the dollar amount of an invoice. As a data analyst, lots of data will
come your way, and records, fields, and values in data tables will
help you navigate analysis. Understanding the structures of the tables
you’re working with is a part of that. And hopefully, while you’re
working hard on your analysis and those tables, you can have a little
fun with a different data table: the one with your favorite playlist!
When discussing structured databases, data analysts refer to the data contained in a row as a record. How do they refer to the data contained in a column?
Field
Data analysts refer to the data contained in a column as a field.
Practice Quiz: Hands-On Activity: Applying a function
Video: Meet wide and long data
- Wide data is a data format in which each row represents a single data subject, and each column represents a single attribute of that subject.
- Long data is a data format in which each row represents a single data point, and each column represents a single variable.
- Wide data is easier to identify and compare different columns.
- Long data is easier to store and analyze multiple variables for each subject at each time point.
- The best data format to use depends on the specific needs of the data analysis task.
Meet wide and long data in Data analytics
Wide and long data are two common data formats used in data analytics. Each format has its own advantages and disadvantages, and the best format to use depends on the specific needs of the analysis.
Wide data
Wide data is a data format in which each row represents a single data subject and each column represents a different attribute of that subject. For example, a wide dataset about the population of Latin and Caribbean countries might have one row for each country and one column for each year.
Long data
Long data is a data format in which each row represents a single data point in time for a single subject, and each column represents a different variable. For example, a long dataset about the population of Latin and Caribbean countries might have one row for each country-year pair, with columns for the country, year, population, and other variables such as the average age of the population.
Advantages and disadvantages of wide and long data
Wide data is good for identifying and comparing different columns. For example, in the wide dataset about the population of Latin and Caribbean countries, it is easy to compare the annual populations of different countries or the populations of the same country at different points in time.
Long data is good for storing and organizing data when there are multiple variables for each subject at each time point. For example, in the long dataset about the population of Latin and Caribbean countries, it is easy to store and analyze data on the population, average age, and other variables for each country-year pair.
Transforming between wide and long data
It is often necessary to transform wide data into long data, or vice versa, depending on the needs of the analysis. For example, if you want to analyze the relationship between the population and average age of Latin and Caribbean countries over time, you would need to transform the wide data into long data.
There are a variety of tools and resources available to help you transform wide and long data. For example, many spreadsheet programs have built-in functions for transforming data formats.
When to use wide and long data
The best data format to use depends on the specific needs of the analysis. If you need to identify and compare different columns, then wide data is a good choice. If you need to store and organize data when there are multiple variables for each subject at each time point, then long data is a good choice.
Here are some examples of when to use wide and long data:
- Wide data:
- Comparing the sales of different products in different regions
- Analyzing the performance of different marketing campaigns
- Identifying the characteristics of different customer segments
- Long data:
- Tracking the customer journey over time
- Analyzing the impact of different interventions on patient outcomes
- Forecasting future sales or demand
By understanding the advantages and disadvantages of wide and long data, you can choose the right format for your analysis.
You probably use the words “wide” and
“long” all the time. You might use “wide” to describe the size
of something from side to side, like a wide river. But a river can also
travel great distances, so you might call it “long” as well. Wait! Before you stop the video, I promise you didn’t accidentally
click in the wrong course. I’m not here to teach you words you
already know. But the words “wide” and “long” can be used to describe data, too. So I am here to help you understand
wide data and long data. So far you’ve dealt with data
arranged mostly in a wide format. With wide data, every data subject has
a single row with multiple columns to hold the values of various
attributes of the subject. Here’s some wide data in a spreadsheet. You might remember we discussed this
data about the population of Latin and Caribbean countries earlier. For this data set, each row provides all
of the population information about one country. Each column shows
the population for a different year. Wide data lets you easily identify and
quickly compare different columns. In our example, the data is arranged
alphabetically by country, so you can compare the annual populations
of Antigua and Barbuda, Aruba, and the Bahamas by just checking
out the values in each column. The wide data format also
makes it easy to find and compare the countries’ populations
at different periods of time. For example, by sorting the data, we discover that Brazil had the highest
population of all countries in 2010, and the British Virgin Islands had the lowest
population of all countries in 2013. Okay, now let’s explore
this data in a long format. Here the data is no longer
organized into columns by year. All the years are now in one column with
each country, like Argentina, appearing in multiple rows, one for each year of data. This is how long data usually looks. Long data is data in which each row
is one time point per subject, so each subject will have
data in multiple rows. Our spreadsheet is formatted to show
each year of population data. Here we see Antigua and Barbuda first. Long data is a great format for storing
and organizing data when there’s multiple variables for each subject at each
time point that we want to observe. With this long data format,
we can store and analyze all of this data using fewer
columns. Plus, if we added a new variable, like the average age of a population,
we’d only need one more column. If we’d use a wide data format instead, we
would have needed 10 more columns, one for each year. The long data format keeps
everything nice and compact. If you’re wondering which
format you should use, the simple answer is, “it depends.” Sometimes you’ll have to transform
wide data into a long data format, or other times vice versa. You’ll probably work with
both formats in your job. And you’ll definitely revisit both
formats again later in this program. That reminds me: earlier we define
data as a collection of facts. As you’ve discovered over the last few
videos, that collection of facts can take on lots of different formats,
structures, types, and more. Learning about all of the ways that
data can be presented will be a big help to you throughout the data
analysis process. The more you work with
data in all its forms, the quicker you’ll start to recognize
which data to use, and when to use it. And in just a bit, you’ll use all that data stored in your
brain to help you take an assessment. After that, you’ll learn how to
identify and avoid bias in data and how to embrace credibility,
integrity and ethics. The data adventure moves forward.
I’m so glad you’re moving with it!
Reading: Transforming data
Practice Quiz: Hands-on Activity: Introduction to Kaggle
Practice Quiz: Test your knowledge on data types, fields, and values
Fill in the blank: Internet search engines are an everyday example of how Boolean operators are used. The Boolean operator _____ expands the number of results when used in a keyword search.
OR
The Boolean operator OR expands the number of results when used in a keyword search.
Which of the following statements accurately describes a key difference between wide and long data?
Wide data subjects can have data in multiple columns. Long data subjects can have multiple rows that hold the values of subject attributes.
What does data transformation enable data analysts to accomplish?
Change the structure of the data
Data transformation enables data analysts to change the structure of data.
Weekly challenge 1
Reading: Glossary: Terms and definitions
Quiz: *Weekly challenge 1*
What is the most likely reason that a data analyst would use historical data instead of gathering new data?
If the project has a very short time frame.
Which of the following are examples of discrete data? Select all that apply.
Box office returns
Movie budget
Number of actors in movie
Which of the following questions collect nominal qualitative data? Select all that apply.
Did anyone recommend our restaurant to you today?
Is this your first time dining at this restaurant?
Have you heard of our frequent diner program?
Which of the following is a benefit of internal data?
Internal data is more reliable and easier to collect.
A data analyst is reviewing data that has been organized into a table format. What type of data is in the table?
Structured data
A Boolean data type must have a numeric value.
False
In wide data, what do the columns contain?
The data variables
What is the term for changing the structure or format of data?
Data transformation