Module 1: Introduction to statistics

You’ll explore the role of statistics in data science and identify the difference between descriptive and inferential statistics. You’ll learn how descriptive statistics can help you quickly summarize a dataset and measure the center, spread, and relative position of data.

Learning Objectives

Use Python to compute descriptive statistics
Determine measures of relative position such as percentile, quartile, and interquartile range
Determine measures of dispersion such as range, variance, and standard deviation
Determine measures of central tendency such as mean, median, and mode
Explain the relationship between parameter and statistic in inferential statistics
Explain the relationship between population and sample in inferential statistics
Explain the difference between descriptive statistics and inferential statistics

Table Of Contents

Get started with the course
Descriptive statistics
Key takeaways
Resources for more information
Calculate statistics with Python
Review: Introduction to statistics

Get started with the course

Video: Introduction to Course 4

Notes

Transcript

This course introduces you to statistics for data analysis, a powerful tool for making informed decisions from data.

Who is this course for?

Beginners with no prior knowledge of statistics.
Individuals with some statistics experience looking to apply it to data analytics.

What will you learn?

Difference between descriptive and inferential statistics.
Descriptive statistics: mean, median, standard deviation.
Inferential statistics: drawing conclusions and making predictions.
Probability and types of probability distributions.
Sampling methods and sampling distributions.
Confidence intervals and their interpretation.
Hypothesis testing and interpreting test results.
Applying statistics in an A/B testing portfolio project.

Benefits of learning statistics:

Make informed business decisions and solve complex problems.
Stand out in data career interviews with practical skills.
Build a foundation for advanced data analysis methods.

Instructor: Evan, an economist who uses statistics to help businesses make data-driven decisions.

Ready to begin? Enroll and start unlocking the power of statistics for your data analysis journey!

Hey there, welcome to the next stage of your journey
to learn advanced data analytics. First, congratulations on your progress. You’ve learned how data professionals
contribute to the success of an organization and the main tools and
techniques they use on the job. You’re now familiar with the basic syntax
and functions of the Python programming language, and you know how to use code for
exploratory data analysis. You can use data wrangling to organize and
clean your data, and create data visualizations to
share important information. Well done. You already have quite a few tools in
your analytic toolbox. Have room for a few more? Next up, statistics. Statistics is the study of the collection
analysis and interpretation of data. Statistics often abbreviated as stats,
provides data professionals with powerful tools and methods for
transforming data into useful knowledge. You’ve already learned about
exploratory data analysis and how it helps you summarize the main
characteristics of your data. Descriptive statistics does this too,
and that’s where we’ll start. But data professionals also use
statistics to do something more, based on a small sample of available data,
they can make informed predictions about uncertain events and make accurate
estimates about unknown values. This is known as inferential statistics,
and You’ll learn all about it in this course. For example, data professionals use statistics
to predict future sales revenue. The success of a new ad campaign, the rate
of return on a financial investment, or the number of downloads for a new app. Statistical analysis can tell you which
version of a website will attract more new customers for longer periods of time. Or, that new users will typically create
an account after spending three minutes on the company’s website. The insights gained from statistical
analysis help business leaders make decisions, solve complex problems and improve the performance of
the products and services. This is why data professionals
are in such high demand and why the data career space keeps growing. Speaking of data professionals,
allow me to introduce myself. My name is Evan, I’m an economist and I
consult with various teams across Google. This means that I use statistics and
other tools to analyze and interpret data to help business
leaders make informed decisions. This includes helping them
quantify uncertainty and identify if there is sufficient
evidence to reject a hypothesis. Both of which you’ll learn
more about later on. I’m thrilled to be your
instructor in this course. Before we begin, let me tell you about
my own experience with statistics. As an undergraduate,
I majored in economics and mathematics, and then went on to get
a PhD in economics. I focused on statistics and econometrics, a branch of economics that uses
statistics to analyze economic problems. During my graduate studies, I interned
at an online learning company and was a researcher at
an online retail company. Across these roles and experiences, I’ve used many different statistical
tools to solve problems. Often I find that the problem I’m working
on can be solved with a statistical method I’m unfamiliar with. I love constantly learning new methods and extending the range of
problems I can work on. These advanced methods are built
on a foundation of stats concepts. In this course, we’ll focus on these
fundamentals to prepare you for your future career. So if you’re new to stats, welcome. This course does not assume you have
any prior knowledge of statistics. We’ll begin from the beginning and
work through each concept step by step. But if you have some experience
in statistics, that’s great too. We’ll help you use what you
already know in a new way so you can apply your stats knowledge
to data analytics specifically. In this course, you’ll discover how data professionals use
statistical tools in their daily work. You’ll also learn strategies for
interpreting findings and sharing them with stakeholders who may
not be familiar with stats concepts or all the technical details. We’ll start this course with an
introduction to the role of stats in data analytics, and we’ll discuss
the differences between descriptive and inferential stats. You’ll learn how descriptive
stats such as mean, median, and standard deviation, help you quickly
summarize and better understand your data. Then we’ll explore how to use inferential
statistics to draw conclusions and make predictions about data. Next we’ll explore probability and discover useful ways to
measure uncertainty. We’ll discuss the basic rules of
probability and how to interpret different types of probability distributions
such as the normal, binomial, and Poisson distributions. From there, we’ll move on to sampling. We’ll discuss what makes a good sample, the benefits and drawbacks of
different sampling methods and how to work with sampling distributions.
We’ll also examine confidence intervals which describe the uncertainty
in an estimate. You learn how to construct different
kinds of confidence intervals and interpret their meaning. After that we’ll explore how to use
hypothesis testing to compare and evaluate competing claims about your data. We’ll go over the steps for applying
different tests to specific data sets and will demonstrate how to
interpret test results. Finally, you’ll get a chance to
apply your stats knowledge in your next portfolio project. The portfolio project features
a scenario based on a/b testing, an important practical
application of statistics. In future job interviews. You can share your project as
a demonstration of your skills and impress potential employers. I’ll be your guide every step of
the way and remember you set the pace. Feel free to go over the videos
as many times as you like and review topics that are new to you. By the end of the course you’ll have a
useful toolkit of stats concepts to carry with you on the rest of
your learning journey and in your future career let’s get started

Video: Evan: Engage and connect

Notes

Transcript

Evan, an economist at Google, shares his journey into data analytics. Starting with lukewarm interest in math, he discovered the power of economics to analyze the world. His daily work involves partnering with businesses to solve problems through brainstorming, data analysis, and presenting insights. He emphasizes the importance of communication, mastering fundamentals, and networking with experienced professionals to accelerate learning in this field.

Hi, I’m Evan. I’m an
economist at Google. When I was in high school, I was pretty good at math but wasn’t overly interested in it. When I got to college, I took an economics course and was very interested in using
that framework to view the world
and solve problems. A typical day on
the job involves me working with
business leaders to understand their
problems and then help them brainstorm solutions
to their problems. Sometimes this is
just talking and consulting on problems and helping them figure
out solutions, or other times I’ll
go in and I’ll gather data on my own from the company and I’ll perform analyses and solve problems and help them identify interesting measures or results that could inform the decisions and
solve the problems. I also like to devote time each day to doing my
own research on topics that I’m unfamiliar
with so that I’m always growing my skills and
increasing my toolkit. Some of the soft
skills that were most important in
starting a career in data analytics was first being able to present
your results. You can do tons of
work in getting data, mining through the data, trying to find some
interesting pieces of information and you
may find something. But then being able to clearly communicate that to
other people who may not be experts
in what you’re studying is actually
quite difficult. Make sure you master the basics and don’t
try to go too fast. If there’s something
you don’t understand, re-watch it, do the reading, make sure you know
the fundamentals. It all builds on itself. If I can give myself some
advice when I was starting my first data analytics
role is to take time to meet other people in the field and to network with them and
learn what they know. A lot of data professionals in this field have
built up lots of knowledge that’s quite useful
specific to their company, specific to their roles, specific to certain
types of problems, and that knowledge lives
inside their head. It’s not written in the book, it’s not in a manual, but
they have this knowledge. The more you talk with
people than where you meet people and talk about these different problems
that you have, the faster you can grow and
the faster you can learn. Instead of having to learn
these things on your own and solve the
problems bit by bit, you can work with
other people and they can just help you
to solve these so much quicker because
they’ve already solved the problem and learned
that information.

VIdeo: Welcome to module 1

Notes

Transcript

Dive into Statistics!

This course covers how data professionals utilize statistics to glean insights and help organizations solve challenges.

What you’ll learn:

Statistics fundamentals: Discover its role in data-driven work and the importance of key concepts.
Real-world application: Witness an A/B test showcasing statistics in action.
Descriptive vs. Inferential: Understand the two main types and their uses.
Exploring data: Learn about measures of central tendency, dispersion, and position.
Python for statistics: Harness Python to calculate descriptive statistics and summarize data.

Ready to begin your statistical journey? Join the next video!

Hi there. I can’t wait to explore statistics with
you during our journey together you’ll learn how data professionals use
statistics to gain insights from data and help organizations solve complex problems. Statistics is the study of the collection,
analysis and interpretation of data. We’ll start off by discussing the
foundational role of statistics in data driven work and the importance
of fundamental stats concepts. Next you’ll have an opportunity
to observe stats in action. We’ll explore an example of how data
professionals use statistical methods to conduct an A/B test. Then we’ll discuss the two main types of
statistics, descriptive and inferential. Data professionals use descriptive
stats to explore and summarize data. They use inferential stats
to draw conclusions and make predictions about data. Next w’ll consider three different
types of descriptive statistics and how they can help you better understand
different aspects of data. Measures of central tendency such as the mean allow you to describe the center of a data
set measures of dispersion like standard deviation. Let you describe the spread of data
measures of position such as percentiles, help you determine the relative
position of the values in the dataset. Finally you learn how to use python
to compute descriptive statistics and summarize a data set. When you’re ready,
join me in the next video.

Video: The role of statistics in data science

Notes

Tutorial

Transcript

Statistics: The Language of Data Science

This video delves into the crucial role of statistics in data analysis and highlights the importance of mastering its fundamentals.

Key points:

Data explosion: The massive generation of data necessitates effective analysis and interpretation, making statistics crucial.
Data science boom: The field thrives alongside the ever-growing data, with statistics underpinning its methods.
Applications across domains: Statistics empowers data professionals in business, medicine, science, and more.
Benefits of statistical understanding: Identify patterns, measure uncertainty, extract insights, predict future trends, and solve complex problems.
Real-life examples: Weather forecasts, sports stats, election polls, medical percentiles – all utilize statistical concepts.
Data professionals leverage statistics: Predict investment returns, estimate sales revenue, assess survey errors, and rank housing prices.
Statistics as the foundation: It forms the cornerstone of data analytics and underpins even the most advanced methods.
Learning fundamentals: The grammar of data analysis: Just like grammatical knowledge enables language use, basic statistics empower data professionals to communicate effectively.
Building upon the foundation: Mastering fundamental concepts leads to advanced topics like hypothesis testing, classification, regression, and time series analysis.

Get excited! The next video offers a glimpse into how statistics are applied in practice.

Statistics: The Language of Data Science – A Beginner’s Tutorial

Welcome to the incredible world of statistics, the essential language for understanding and using data science! This tutorial will guide you through the fundamentals, equipping you to unlock the secrets hidden within data.

Part 1: Why Statistics Matter

Data Deluge: We generate more data than ever before, from online interactions to scientific experiments. Statistics helps us make sense of this vast information and extract valuable insights.
Data Science Boom: This rapidly growing field relies heavily on statistics, making it a powerful skill for various careers.
Impact Across Domains: Whether in business, medicine, science, or government, statistics helps professionals solve complex problems and make informed decisions.

Part 2: Unveiling the Basics

Data Wrangling: We’ll start by learning how to collect, organize, and visualize data using tools like spreadsheets and basic programming.
Descriptive Statistics: Explore measures like mean, median, and standard deviation to understand central tendencies and data spread.
Probability and Inference: Dive into the logic of chance and uncertainty, making predictions and drawing conclusions from data samples.

Part 3: Putting Stats into Action

Real-World Examples: Witness how statistics are applied in different scenarios, from analyzing election polls to evaluating the effectiveness of treatments.
Statistical Software: Learn about powerful tools like Python and R to perform complex statistical analyses with ease.
Case Studies: Analyze real datasets and practice solving problems to solidify your understanding and gain practical experience.

Part 4: Advanced Exploration (Optional)

Hypothesis Testing: Learn how to formulate and test hypotheses to draw more robust conclusions from data.
Regression Analysis: Explore techniques to model relationships between variables and predict future outcomes.
Time Series Analysis: Unlock the secrets of data with a time dimension, like stock prices or weather patterns.

Remember:

Statistics is a journey, not a destination. Start with the basics and gradually build your knowledge through practice and exploration.
There are tons of resources available online, including tutorials, courses, and communities. Don’t hesitate to seek help and connect with other aspiring data scientists.
Most importantly, have fun! Discovering the power of statistics opens doors to endless possibilities in the fascinating world of data.

Additional Tips:

Engage with interactive learning platforms and gamified experiences to make statistics more enjoyable.
Participate in online forums and communities to connect with mentors and peers for support and guidance.
Find real-world datasets that interest you and apply your statistical skills to answer your own questions.

This is just a starting point. With dedication and curiosity, you’ll soon be fluent in the language of statistics and ready to unlock the hidden potential of data!

Earlier, you learned that statistics is the study
of the collection, analysis and
interpretation of data. Today, humans generate and collect more data
than ever before. Whenever we send a text message, make an purchase online, or post a photo on social media, we generate new data. As the amount of data grows, so does the need to
analyze and interpret it. This is a big reason why stats and data-driven work
is so important. and the field of
data analytics is growing almost as fast
as the data itself. Data professionals
use statistics to analyze data in business, medicine, science, engineering,
government, and more. In this video, we’ll discuss the role of statistics
in data science and why learning fundamental
stats concepts is essential for every
data professional. Data professionals use the
power statistical methods to identify meaningful patterns
and relationships in data, analyze and quantify uncertainty, generate insights from data, make informed predictions about the future and solve
complex problems. Even if you’ve never
studied statistics, you probably use stats daily. For example, you
may start your day by going online and
checking the weather, where you learn that
the forecast is for a 70 percent chance of rain, or a 50 percent chance of snow. Perhaps you visit
a sports website to learn the batting average of your favorite cricket player or the scoring average of your
favorite basketball player. On a news app, you might come across
an election poll that reports a three
percent margin of error and notes that an online survey was used
to collect the data. Or perhaps you’re a parent, and when you take your child
to their yearly checkup, you learn that your child is in a certain percentile
for height and weight. When you ask for
more information, the doctor shows you
the median height and weight for all kids
who are the same age. These scenarios include
statistical concepts that you’ll learn more
about in this course. The weather report is based on probability or the
likelihood of an event. The sports stats
express average value. The election poll
shows margin of error. The doctor uses the concept
of percentile and median. All these stats give you useful knowledge that you
can apply to your own life. Data professionals use the
same concept in their work. For example, a data
professional might use probability to predict
the future rate of return on an investment. They might estimate the annual average sales revenue
for a company, calculate the margin
of error to quantify the uncertainty of an
employee satisfaction survey, or use percentiles to rank median home prices
in different cities. On the job, data
professionals use stats to transform data into insights that help stake holders
make decisions. Statistics is the foundation
of data analytics and is the basis for the
most advanced methods of analysis that data
professionals use. It all begins with the
fundamental concepts that we’re exploring
in this course. Consider the role
of grammar plays in your conversations, for example, when you chat with
friends or coworkers, you’re probably
not thinking about grammatical concepts like
the parts of speech. If you’re having a conversation, then you already know
how to use nouns, verbs, and adjectives. Knowledge of basic
grammar makes it possible to use language
in the first place. This is why it’s
so foundational. In a similar way,
shared knowledge of basic statistics allows
data professionals to use a common language. Learning the basics will
eventually let you join the conversation about
more advanced topics. You’ll build on your
foundation in statistics with more complex methods
like hypothesis testing, classification, regression,
and time series analysis. I hope you’re getting excited about how data professionals use stats to make sense of their data and gain
useful knowledge. Coming up, you’ll get
a chance to check out an example of
stats in action.

Video: Statistics in action: A/B testing

Notes

Tutorial

Transcript

Summary of A/B testing and statistics:

This video explains the importance of statistics in understanding and interpreting A/B tests, a popular method for comparing two versions of something (e.g., webpages, ads) to see which performs better.

Key Points:

Data-driven decisions: Businesses use A/B testing to optimize products and improve customer experience based on data, not intuition.
What is A/B testing? It compares two versions (A and B) of something to a randomly selected group of users. The version with better results (e.g., higher conversion rate) is chosen.
Statistics in A/B testing:
- Sampling: Analyze a small group (sample) to draw conclusions about the entire population (website visitors).
- Sample size: Choosing the right sample size is crucial for accurate results. Statistics help determine the optimal size.
- Confidence intervals: Quantify the uncertainty surrounding an estimate (e.g., conversion rate). Helps interpret results with awareness of potential error.
- Statistical significance: Determine if the observed difference between versions is due to chance or a real effect. Hypothesis tests provide data-driven support for decisions.
Benefits of learning statistics:
- Design, conduct, and interpret A/B tests effectively.
- Build a foundation for more advanced data analytics methods.
- Make data-driven decisions in various contexts.

This course will teach you the relevant statistical concepts and apply them to analyze and interpret A/B test results, ultimately building your data analysis skills.

Tutorial: Demystifying A/B Testing with Statistics

Welcome to this tutorial where we’ll explore the exciting world of A/B testing, powered by the insights of statistics!

What is A/B testing?

Imagine you’re a website owner wanting to increase sales. You have two website designs, A and B, but don’t know which one converts better. Enter A/B testing! It involves:

Creating two versions: Version A (original) and Version B (modified).
Showing each version to random groups: Visitors are randomly assigned to see either A or B.
Analyzing results: You compare conversion rates (e.g., purchases) between versions to see which performs better.

Statistics: The secret sauce

Raw data from A/B tests reveals little. Here’s where statistics become crucial:

1. Sampling:

Not everyone sees both versions. Statistics help determine the ideal sample size (number of visitors) for accurate results.

2. Confidence intervals:

The conversion rate you observe might not reflect the entire population. Statistics help create a “confidence interval,” a range of likely values for the true conversion rate, considering the uncertainty based on your sample.

3. Statistical significance:

Is the observed difference due to chance or the actual change (e.g., larger button in B)? Statistics provide “hypothesis tests” to assess if the difference is statistically significant, meaning it likely reflects a real effect, not random noise.

Tools in your statistical belt:

Sample size calculators: Determine the optimal number of users for your test.
Confidence interval calculators: Estimate the true conversion rate range with uncertainty accounted for.
Hypothesis testing tools: Assess if the observed difference is statistically significant.

Benefits of statistical prowess:

Make data-driven decisions: Don’t rely on gut feeling. Use statistics to confidently choose the winning version based on evidence.
Avoid costly mistakes: Implementing a less effective version due to chance variations can be expensive. Statistics help you make informed choices.
Sharpen your data analysis skills: A/B testing serves as a springboard for more advanced data analysis techniques.

Ready to put theory into practice?

Choose your test subject: Maybe a website landing page, email subject line, or app feature.
Define your goal: Increase conversions, engagement, or another metric.
Develop your hypotheses: What outcome do you expect for each version?
Design your versions: Make clear, measurable modifications to one version.
Set your sample size: Use online calculators based on your expected effect size and desired confidence level.
Run the test: Use A/B testing tools or build your own with randomization.
Collect and analyze data: Track the chosen metric for each version.
Interpret results: Calculate confidence intervals and conduct hypothesis tests.
Draw conclusions: Was the difference statistically significant? Which version wins?
Take action: Implement the winning version or iterate based on learnings.

Remember, statistics are your allies in navigating the world of A/B testing. By understanding these concepts and applying them effectively, you can make data-driven decisions that optimize your products, websites, and marketing campaigns, ultimately leading to success!

Today’s economy is
all about data. Business leaders want to make data-driven decisions based
on evidence and analysis. Companies that use insights
gained from data to guide their decision-making
process are more likely to be successful than
companies that don’t, and data professionals are the people that generate
those insights. They use statistics to
transform data into knowledge, and help stakeholders
make informed decisions. All the fundamental stats
concepts that we cover in this course have valuable
practical applications. In this video, you’ll get a chance to check
out stats in action. We’ll go over one of the
most popular applications of statistics for
business, A/B testing. I’ll discuss how
the stats concepts you’ll learn in this course can help you analyze, and interpret
data using an A/B test. Companies use A/B testing to evaluate everything from website design, to mobile apps, to online ads, to
marketing emails. A/B testing is a way to compare two versions of
something to find out which version
performs better. A/B testing has become popular because it works well for
many online applications. For example, businesses often
use A/B testing to create two versions of a webpage
to find out which one gets more clicks, purchases,
or subscriptions. Even small changes to a webpage, like changing the color, size, or location of a button can increase
financial gains. A/B test help business
leaders optimized product performance and
improve customer experience. Another way companies use A/B testing is for
marketing emails. You might send two
versions of an email to your customer list to find
out which version results in more sales, or you might
test two versions of an online ad to discover which one visitors click
on more often. Once you’ve conducted
at the A/B test, you can use the data to make permanent changes to your ad. Let’s go through an example
of an A/B test step by step. Imagine you run an
online store and 10 percent of visitors to
your website make a purchase. You want to run an
A/B test to find out if changing the size of
the add to cart button will increase the
conversion rate, or the percentage of customers
who purchase a product. The test presents two
versions of your webpage, known as version
A and version B, to a group of randomly
selected users. Version A is the
original webpage. Version B is the webpage with the larger add
to cart button. The test directs
half the users to version A and half to version B. The test runs for two weeks. When the test is over,
a statistical analysis of the results indicates that the larger
button in version B resulted in an
increase in purchases. The conversion rate for
version B is 30 percent. This is three times greater than the conversion
rate for version A, which is 10 percent. That’s a notable increase. Because of your A/B test, your company has a data-driven
reason for replacing the current webpage with version B and increasing the size of an add
to cart button. Now that you know how
an A/B test works, let’s explore the stats
concepts behind A/B testing. Later on, we’ll cover each
concept in more detail. Think of this list as a brief preview of your
future stats knowledge. The A/B test analyzes a small
group of users drawn from the total population of all
users that visit the website. In stats, we call this
smaller group the sample. The sample is a subset of
the larger population. You can use data from a
sample to make inferences or draw conclusions about
the entire population. Data professionals use
inferential statistics to make inferences about a dataset
based on a sample of the data. In other words, stats are a
powerful tool for predicting outcomes you don’t know
using data you do know. For example, you have no
way of knowing how the next 100,000 website
visitors will behave. What you can do is observe the next 1,000 visitors, and then use inferential
statistics to predict how the following
99,000 will behave. As you’ll discover,
stats can help you make that prediction
with accuracy. This is why observing
a sample through A/B testing can be so
valuable to companies. They can use the results
of the test to make changes that improve
their business. Sampling, the process of
selecting a subset of data from a population, is a
critical part of an A/B test. Before you conduct the test, you need to decide
on the sample size or the number of
users in the test. Choosing the right sample
size helps you get valid test results and
avoid statistical errors. For example, you’ll use stats to help you determine
whether you need to use a sample size
of 1,000 or 10,000 in order to predict customer
behavior accurately. Like any statistical test, an A/B test can’t predict user behavior with 100
percent certainty. What stats can do is construct
a confidence interval or a range of values that describes the uncertainty or
surrounding an estimate. Knowing how to construct and interpret a confidence
interval helps you make informed decisions about all users based on
your test sample. Using stats, you can
quantify the uncertainty of your A/B test and share this information
with stakeholders to help them interpret
the results. We’ll talk more about
how to interpret a confidence interval later on. After the test is complete, you’ll need to determine
the statistical significance of your results. Statistical
significance refers to the claim that the results of a test or experiment are not
explainable by chance alone. For instance, is the difference between version A
and version B due to random chance or due to the fact that you changed
the add to cart button? A hypothesis test is a statistical method that helps
you answer this question. The test helps quantify whether
the result is likely due to chance or if it’s
statistically significant. A hypothesis test gives you data-driven
support for changing your webpage to version B or for keeping it the
same with version A. Software can help you calculate
complex math problems. But having a working
knowledge of stats lets you properly design, conduct, and interpret the
results of a real test. By the end of this course, you’ll know how to use
all the stats concepts we just reviewed to analyze
and interpret data. In fact, you’ll be able to put your stats skills to work in a portfolio project based on a realistic A/B
testing scenario. Plus, your knowledge of stats will serve as
a foundation for more advanced data analytics methods that you’ll
explore later on.

Video: Descriptive statistics versus inferential statistics

Notes

Tutorial

Quiz

Transcript

Summary of Descriptive and Inferential Statistics

This video discusses two main types of statistical methods: descriptive and inferential.

Descriptive statistics:

Describe or summarize the main features of a dataset.
Useful for understanding large amounts of data quickly.
Examples include:
- Visualizations: Graphs, tables, etc.
- Summary statistics: Mean, median, standard deviation, etc.
Used to describe the center (mean) and spread (standard deviation) of data.

Inferential statistics:

Allow data professionals to make inferences about a population based on a sample.
A population is all possible elements you’re interested in, while a sample is a subset.
Used to draw conclusions about the larger population based on data from a smaller sample.
Important to have a representative sample that reflects the population accurately.
Examples include:
- Parameters: Characteristics of a population (e.g., average height of all giraffes).
- Statistics: Characteristics of a sample (e.g., average height of 100 giraffes).
Statistics are used to estimate unknown population parameters.

Key takeaways:

Both descriptive and inferential statistics are important for data analysis.
Descriptive statistics help understand the data itself, while inferential statistics help draw conclusions about the wider population.
Choosing a representative sample is crucial for inferential statistics to be reliable.

Tutorial: Unveiling Data Stories with Descriptive and Inferential Statistics

Welcome, data explorers! This tutorial delves into the fascinating world of statistics, where we transform numbers into meaningful stories. We’ll uncover two key approaches: descriptive statistics to understand data itself and inferential statistics to draw conclusions about larger populations based on samples.

Part 1: Painting a Portrait with Descriptive Statistics

Imagine you have a treasure chest filled with coins from different countries. Descriptive statistics are like opening the chest and examining the coins:

1. Visualizations:

Histograms: See how many coins fall within certain value ranges (e.g., how many are worth 1 unit, 5 units, etc.).
Scatter plots: Explore relationships between variables (e.g., coin value vs. year of minting).
Box plots: Summarize the spread of data (e.g., identify outliers, compare value distributions).

2. Summary Statistics:

Mean (average): The typical value (e.g., average value of all coins).
Median: The “middle” value when all coins are arranged in order.
Mode: The most frequent value (e.g., the most common coin value).
Standard deviation: Measures how spread out the data is (e.g., how much coins deviate from the average).

Part 2: Making Predictions with Inferential Statistics

Now, let’s say you want to predict the value of a hidden coin based on the ones you’ve seen. Inferential statistics help you make informed guesses:

1. Samples and Populations:

Population: All possible coins (think of the entire coin collection).
Sample: A subset of coins you actually examine (like the coins you took out).

2. Drawing Inferences:

We can’t examine every coin, so we use a sample to make inferences about the whole population.
Example: We calculate the average value of the sample coins and use it to estimate the average value of all coins (population parameter).

3. Confidence Intervals:

This tells us how confident we are in our estimate. Imagine a range around the sample average; the wider the range, the less certain we are about the population average.

4. Hypothesis Testing:

We can test specific claims about the population (e.g., “all coins are fair”).
We collect data, analyze it statistically, and decide if the claim is likely true or not.

Remember:

Sample size matters: Larger samples lead to more reliable inferences.
Random sampling: Your sample should be representative of the population to avoid bias.

Bonus:

Explore statistical software like R, Python, or Excel to analyze your data and create visualizations.

By mastering these techniques, you’ll transform from a data observer to a data storyteller, uncovering hidden patterns and making informed decisions!

Further Exploration:

Check out online resources and tutorials for deeper dives into specific statistical methods.
Practice with real-world datasets to apply your newfound knowledge.
Remember, statistics is a journey, not a destination. Keep learning and exploring the power of data!

Descriptive statistics enable data professionals to summarize the main features of a dataset.

True

Descriptive statistics enable data professionals to summarize the main features of a dataset. They also describe the dataset so people can quickly understand large amounts of data.

Now that you know more
about the role of statistics and data science, let’s discuss the
two main types of statistical methods,
descriptive and inferential. Data professionals
use each method to get different insights
from their data. In this video, you’ll
learn the difference between descriptive
and inferential stats and how data
professionals use both to better understand their data. Descriptive statistics
describe or summarize the main
features of a dataset. Descriptive stats are very
useful because they let you quickly understand
a large amount of data. For example, let’s
say you had data on the heights of
10 million people. You probably don’t want to scan 10 million rows of data to
analyze it for your report. Even if you did, it would be difficult to interpret the data. However, if you
summarize the data, you can instantly
make it meaningful. Finding the mean
or average height gives you useful
knowledge about the data plus reading a summary is much better than staring at
millions of rows of data. There are two common forms
of descriptive statistics, visuals like graphs and
tables, and summary stats. Previously you learned
how graphs and tables can help you explore, visualize, and share your data. You’re likely familiar with data visualizations
such as histograms, scatter plots, and box plots. Summary statistics let you summarize your data
using a single number. A common example is the
mean or average value. There are two main
types of summary stats, measures of central tendency
and measures of dispersion. Measures of central
tendency like the mean, let you describe the
center of your dataset, measures of dispersion
like standard deviation, let you describe the spread of your dataset or the amount of variation in your data points. Stats like mean and
standard deviation are used to describe
and summarize data. But data professionals do more than just described their data. They also draw conclusions and make predictions
based on data. For this, they use
inferential statistics. Inferential statistics allow
data professionals to make inferences about a dataset
based on a sample of the data. The dataset that the sample is drawn from is
called the population. The population includes
every possible element that you are interested
in measuring. As we’ve discussed, a sample
is a subset of a population. Data professionals
use samples to make inferences
about populations. In other words,
they use the data they collect from
a small part of the population to
draw conclusions about the population as a whole. Note that a statistical
population may refer to people,
objects, or events. For instance, a
population might be the set of all
residents of a country, all the planets in
our solar system, or all the outcomes
of 1,000 coin flips. A sample is a smaller group or subset of any of
these populations. Samples might be residents, planets, or coin flip outcomes. Let’s check out an example. Say you want to research
the music preferences of every college student in the United States to find
out whether they prefer pop, rap, country, classical,
or another genre of music. There around 20 million
college students in the United States
and would be too expensive and time-consuming to gather data from
every single person. Instead, you can
use a sample and survey only a subset of
the 20 million students. Later on, we’ll discuss the
factors that go into choosing different sample sizes and how larger sample sizes
affect your results. For now, let’s
imagine you decide to survey 1,000 students
instead of 20 million. Then you can use the results
to make inferences about the music preferences of
all college students. Keep in mind that
your sample should be representative of
your population. Otherwise, the
conclusions you draw from your sample will be unreliable
and possibly biased. A representative sample
is a sample that accurately reflects the
population, for example, if you only survey math majors
or only student athletes, your sample will not
be representative of all college students. Finally, let’s review two
terms that correspond to population and sample,
parameter and statistic. A parameter is a characteristic
of a population. A statistic is a
characteristic of a sample, for example, the
average height of the entire population of
giraffes is a parameter. The average height of
a random sample of 100 giraffes is a statistic. As I mentioned, it’s
difficult to collect data about every member of
a large population. In this case, to locate
and measure the height of every single
giraffe in the world. We use the known value
of a sample statistic, the average of 100 giraffes, to estimate the unknown value of a population parameter.
That’s all for now. We’ve covered a lot of
key concepts that are foundational for what you’ll learn later on in the course. Coming up, we’ll return to the topic of
inferential statistics. We’ll explore sampling in more detail and checkout
common methods of inferential stats such as confidence intervals
and hypothesis testing.

Practice Quiz: Test your knowledge: The role of statistics in data science

A data professional is analyzing real estate data. To estimate the mean rent of all the apartments in a large city, they calculate the mean rent of a random sample of 100 apartments. Which of the following best describes this statistical method?

Inferential statistics

This statistical method is inferential statistics, which makes inferences about a population based on a sample of the data.

The correct answer is Inferential statistics.

Explanation:

Descriptive statistics summarize the characteristics of a dataset. In this case, calculating the mean rent of the sample of 100 apartments is an example of descriptive statistics. It tells us the average rent within the sample itself.
Inferential statistics use sample data to make inferences about a population. In this case, the data professional is not just interested in the average rent of the 100 apartments they sampled, but they want to estimate the average rent for all apartments in the city (the population). This is where inferential statistics comes in. By calculating the mean rent of the sample and considering the sampling method (e.g., random sampling), we can make an estimate (e.g., with a confidence interval) of the population mean rent.

Therefore, inferential statistics is the most appropriate method for this scenario because it allows the data professional to draw conclusions about a larger population based on a sample.

Other options:

Data cleaning is the process of identifying and correcting errors in data. It is not relevant to the scenario described in the question.
A/B testing is a type of experiment where two different versions of something are compared to see which one performs better. It is not relevant to the scenario described in the question.

In statistics, a population can only include people.

False

In statistics, a population can include people, objects, events, or measurements.

The statement “In statistics, a population can only include people” is False. In statistics, a population can refer to any complete set of similar items or events that are of interest for a study or experiment. This can include:

People: This is certainly one common type of population, such as the population of a city, country, or the entire world.
Objects: This could be anything from manufactured items like cars or phones to natural objects like trees or stars. For example, you might study the population of all cars manufactured by a specific company in a certain year.
Events: This could be any event that can occur repeatedly, such as the outcome of a coin toss or the results of an election. You might study the population of all possible outcomes of a dice roll.

Therefore, it’s important to remember that populations can be diverse and encompass a wide range of things, not just people.

The mean weight of an entire population of elephants is an example of which of the following concepts?

Parameter

The mean weight of an entire population of elephants is an example of a parameter. A parameter is a characteristic of a population.

The correct answer is Parameter.

Here’s why:

Measure of dispersion: This describes how spread out the data is within a dataset. It wouldn’t apply to a single value like the mean weight of the entire population.
Data visualization: This is the graphical representation of data, not the data itself.
Statistic: This is a characteristic of a sample, calculated from a subset of data. The mean weight of the entire population isn’t derived from a sample, it’s the true value for the whole group.
Parameter: This is a characteristic of the entire population, measured from every individual in the population. Since the question mentions the “mean weight of an entire population” and not a sample, this is the most fitting option.

Therefore, the mean weight of an entire population of elephants is a parameter.

Descriptive statistics

Video: Measures of central tendency

Notes

Tutorial

Quiz

Transcript

Exploring Data: Measures of Central Tendency

This video dives into three key measures of central tendency: mean, median, and mode.

Key Points:

Understanding the Center: Just like exploring a city, analyzing data starts with understanding its center.
Measures of Central Tendency: These values represent the “center” of a dataset.
Mean: The average of all values, calculated by adding them up and dividing by the total number.
Median: The middle value when data is arranged from smallest to largest.
Mode: The most frequently occurring value.
Choosing the Right Measure:
- Mean: Ideal for normally distributed data without outliers.
- Median: Better for skewed data or datasets with outliers.
- Mode: Useful for categorical data to identify the most frequent category.
Application: Understanding the center helps you grasp the overall structure and guide further analysis.

Additional Notes:

Outliers can significantly impact the mean, making the median a more robust choice in such cases.
The mode is particularly helpful for understanding common categories in categorical data.

Demystifying Data: A Guide to Measures of Central Tendency

Welcome, data explorers! Today, we embark on a journey to understand the heart of your data: its central tendency. Just like knowing the city center helps navigate a new place, measures of central tendency unveil the core characteristics of your dataset.

The Power of the Center:

Imagine you’re analyzing exam scores. Knowing the “average” score wouldn’t tell you much. Measures of central tendency refine this by pinpointing the central value, where most data points cluster. This allows you to:

Grasp Overall Structure: Understand the “typical” value and spread of your data at a glance.
Make Comparisons: Compare central tendencies across different groups or datasets.
Inform Next Steps: Guide further analysis by identifying interesting patterns or areas for deeper investigation.

The Three Musketeers of Centrality:

Now, meet the three key players in this data-detective game:

Mean: The familiar average, calculated by adding all values and dividing by the total count. Think “balancing all exam scores on a scale.”
Median: The middle value when data is arranged in ascending order. Imagine “lining up exam scores from lowest to highest and picking the one in the middle.”
Mode: The most frequent value. Think “the exam score most students got.”

Choosing the Right Tool:

Each measure has its strengths:

Mean: Ideal for normally distributed data (think “bell curve”) without outliers.
Median: Robust against outliers and skewed data (think “exam scores with a few very high or low scores”).
Mode: Perfect for understanding common categories in categorical data (think “favorite movie genres”).

Real-World Examples:

Mean: Analyzing average house prices in a neighborhood (excluding outliers like mansions).
Median: Measuring typical income in a city with skewed wealth distribution.
Mode: Identifying the most popular clothing size in a clothing store’s inventory.

Remember:

Outliers can distort the mean, making the median a better choice.
The mode is not always present, especially with evenly distributed data.
Consider your data type, research question, and desired insights when selecting the best measure.

Beyond the Basics:

Explore how these measures relate to other concepts like standard deviation (spread of data).
Practice calculating and interpreting different measures using real-world datasets.
Remember, measures of central tendency are just the beginning. Dive deeper into data analysis to uncover hidden patterns and insights!

With this newfound knowledge, you’re equipped to navigate the heart of your data with confidence. Go forth, explore, and unlock the secrets hidden within!

Fill in the blank: The _____  is the average value in a dataset.

mean

The mean is the average value in a dataset.

Every time I explore new dataset,
I feel a sense of excitement. It’s like exploring a city for
the first time. When I visit a new city, I often start
my journey at the city’s center. This way, I can figure out
the distance between the center and the city limits or
to a famous landmark I want to visit. Knowing where I am in relation to
the center helps me find my way around. It’s the same when I want to
learn about a new dataset. First, I want to know
the center of my dataset, then I want to know how spread out
the other values are from the center. Measuring the center and the spread of
the dataset helps me quickly understand its overall structure, and decide which
parts I want to explore in more detail. Earlier, you learn that summary statistics
include measures of central tendency and measures of dispersion. Measures of central tendency are values
that represent the center of a dataset. Measures of dispersion are values that
represent the spread of a dataset. We’ll talk more about measures
of dispersion later on. In this video, you’ll learn how to calculate three
measures of central tendency. The mean, the median and the mode. You may remember these terms
from earlier in the program, but we’ll discuss their importance in
the study of statistics and data analysis. We’ll also discuss which measure is
best to use based on your specific data. Let’s start with the mean. The mean is the average
value in the dataset. To calculate the mean, you add up all the values in your dataset
and divide by the total number of values. For example, say you have the following
set of values, 10, 8, 5, 70. To find the mean, you add up all
the values for a total of 100, then you divide by 5,
the total number of values. The mean or average value is 20. Next, the median. The median is the middle
value in a dataset. This means half the values in the dataset
are larger than the median and half are smaller. You can find the median by arranging all
values in the dataset from smallest to largest. If you arrange your five values in
this way, you get 5, 7, 8, 10, 70. The median or middle value is eight. If they’re an even number
of values in your dataset, the median is the average
of the two middle values. Let’s say you add another value, 4,
to your set. Now, the two middle values are seven and
eight. The median is their average, 7.5. You may notice that the mean,
20 is much greater than the median, 8. This is because there’s one extreme value,
70 that increases the overall average. This value is known as an outlier. Recall that an outlier is a value
that differs greatly from the rest of the data as measures of central tendency. The mean and the median work better for
different kinds of data. If there are outliers in your dataset, the median is usually a better
measure of the center. If there are no out layers,
the mean usually works well. For example, imagine you want to buy
a home in a specific neighborhood. You tour ten homes in the area to
get an idea of the average price. The first nine homes
have a price of $100,000. The tenth home has the price
of a million dollars. This is an outlier that
pulls up the average. If you add all the home prices and
divide by 10, you find that the mean or average price is $190,000. The mean doesn’t give you a good measure
of the typical value of a home in this neighborhood. In fact, only one home out of
ten costs more than $100,000. The median home price is $100,000. The median gives you a much better idea
of the typical value of a home in this neighborhood, whether you use the mean or
median depends on the specific dataset you’re working with and what insights
you want to gain from your data. Finally, we have the mode. The mode is the most frequently
occurring value in the dataset. A dataset can have no mode,
one mode or more than one mode. For example, the set of numbers: one,
two, three, four, five, has no mode
because there are no value repeats. In the set one, three, three,
five, seven, the mode is three because three is the only value
that occurs more than once. The set one, two, two, four,
four has two modes, two and four. The mode is useful when working with
categorical data because it clearly shows you which category occurs most frequently. Say an online retail
company conducts a survey. Customers rate their experience as bad,
mediocre, good or great. A bar chart summarizes the results. The highest bar refers to the rating bad. This is the most frequently
occurring value or mode. The mode gives the company clear
feedback on customer satisfaction. To recap, the mean, median and mode all measure the center of
a dataset in different ways. The mean finds the average value. The median finds the middle value and the mod finds the most
frequently occurring value. Knowing the center of your dataset helps
you quickly understand its basic structure and determine the next
steps in your analysis. Just like knowing the city center
helps orient you in a new environment.

Reading: Measures of central tendency: The mean, the median, and the mode

Reading

Measures-of-central-tendency_-The-mean-the-median-and-the-mode Download

Video: Measures of dispersion

Notes

Quiz

Transcript

Understanding Data Spread: Range and Standard Deviation

This video explores how to measure the spread of data, focusing on the range and standard deviation.

Key Points:

Spread: How data points are distributed around the center (mean).
Range: Difference between the largest and smallest values.
Standard Deviation (SD): Measures how spread out values are from the mean.
- Larger SD indicates more spread, smaller SD indicates less spread.
- Formula: square root of the average squared distances from the mean.
Applications: Understanding weather patterns, predicting ad revenue, analyzing stock prices, etc.

Examples:

Daily temperatures: City A (mean 66°F, SD 3°F) vs. City B (mean 64°F, SD 16°F).
- Similar mean temperatures, but City B has more daily variation.
Data sets with the same mean but different spreads: Illustrates the importance of considering both center and spread.

Additional Notes:

The video mentions variance (square of SD) but will be covered later.
Statistical software is typically used for calculations.
Understanding concepts behind calculations is crucial for applying them effectively.

What descriptive statistic measures the spread of the values from the mean of a dataset?

Standard deviation

Standard deviation measures the spread of the values from the mean of a dataset.

As I mentioned earlier, there are two things
I want to know when I began to explore a new dataset. First, the location
of the center or the measures of central
tendency like the mean. Second, I want to know
how spread out the values are from the center or the
measures of dispersion, like the standard deviation. To get a complete
picture of the data, it’s good to know both the
center and the spread. For example, datasets with
the same central value can have different levels
of variability or spread. Take three small data sets. Each set has three values
that add up to 90. Each set has the same
mean, 90/3 equals 30. But the variation of values around the mean is
much different. In the first set, the values 25, 30 and 35 are all close
to the mean of 30. The third set, the values 5, 10, and 75 are much more
spread out from the mean. Earlier, you learned how
measures of central tendency, like mean, median, and mode represent the
center of your dataset. Now, you’ll learn how
measures of dispersion, such as the range and
standard deviation can help you understand
the spread of your data. The range is the
difference between the largest and smallest
value in a dataset. You have data on the
daily temperature in Fahrenheit for the city
of Central Valley, Costa Rica for the past week. The highest temperature
is 77 degrees, the lowest temperature
is 67 degrees. So the range is 10. The range is a useful metric because it’s easy
to calculate and gives you a very
quick understanding of the overall spread
of your dataset. However, standard
deviation gives you a more nuanced idea of the
variation in your data. Standard deviation
measures how spread out your values are from the
mean of your dataset. It calculates the
typical distance of a data point from the mean. The larger the
standard deviation, the more spread out your
values are from the mean. Another measure of spread
is called the variance, which is the average of the squared difference of each
data point from the mean. Basically, it’s the square
of the standard deviation. You’ll learn more
about variance and how to use it in a later course. Let’s check out the plots of three normal probability
distributions to get a better idea of spread. Later on, you’ll
learn more about distributions which map all
the values in a dataset. For now, just know
that the mean is the highest point on each
curve, right in the center. Each curve has a different
standard deviation. Blue is one, green is
two, and red is three. The blue curve has the
least spreads since most of its data points
fall close to the mean. Therefore, the blue curve has the small standard deviation. The red curve has the
largest standard deviation. It has the most
spread since most of its data points fall
farther away from the mean. Now, let’s talk about how you determine with these numbers. Here’s the formula for the standard deviation
of a sample. In other words,
the square root of Sigma times open
parenthesis x minus x bar, close parenthesis squared
divided by n minus 1. If you’re new to
stats this formula, it may seem like a secret code
or an unfamiliar language. That’s okay. We’ll go over the variables and
formula step-by-step. Plus you don’t need to memorize the formula or do all
the math on your own. As a data professional, you’ll typically use a
computer for calculations. Being able to
perform calculations is important for
your future career, but being familiar
with the concepts behind the calculations will help you apply
statistical methods to workplace problems. There are different
formulas to calculate the standard deviation for
a population in a sample. As a reminder, data professionals
typically work with sample data and then make inferences about populations
based on the sample. So, we’ll review the
formula for a sample. Let’s consider how to
interpret the formula by calculating the standard
deviation of a small dataset, 8, 10, and 12. Calculating the standard
deviation involves five steps. First, find the mean
of the dataset. The mean equals 10. Next, for each value
x and our dataset, we find the distance to the
mean, and then we square it. We’ll include that
calculation in our next step. The Greek letter sigma is
a symbol that means sum. We need to do the
x minus 10 squared calculation for each data point and add up all the results. That’s 8 minus 10 squared is 4, 10 minus 10 squared is 0, and 12 minus 10
squared equals 4. Add all those
together to equal 8, then divide that
total by n minus 1. N refers to the total number of values in your
dataset, which is 3. Three minus 1 equals 2. Then our sum of 8
divided by 2 equals 4. Finally, take the
square root of 4, that’s 2, the
standard deviation. Now let’s explore
an example of how standard deviation is
useful in everyday life. Meteorologists use
standard deviation for weather forecasting
to understand how much variation exists in
daily temperatures in different places and to make more accurate predictions
about the weather. Imagine two
meteorologists working in two different cities, City A and City B. During the month of March, City A has a mean temperature of 66 degrees Fahrenheit and a standard deviation
of three degrees. City B has a mean
temperature of 64 degrees Fahrenheit and a standard
deviation of 16 degrees. Both cities have similar
mean temperatures. In other words, the overall average temperature
is about the same, but the standard deviation
is much higher in City B. This means that there’s
more daily variation in temperature in City B. The weather may
change dramatically from day to day there. In City A, the weather
is more consistent. If the meteorologist in City B predicted the
weather based on the mean, they could be off by 16 degrees, which would lead to a lot
of unhappy residents. The standard deviation gives the meteorologist a
useful measure of variation to consider and a level of confidence
about their prediction. A low standard deviation in temperature makes
it a lot easier for the meteorologist in City A to accurately predict
the daily weather. Data professionals new
standard deviation to measure variation in many types
of data like ad revenues, stock prices, employee
salaries, and more. Now you have a better idea of how standard deviation measures
the spread of your data. Coming up, we’ll discuss
some ways of understanding the relative position of
the values in a dataset.

Reading: Measures of dispersion: Range, variance, and standard deviation

Reading

Recently, you learned that measures of dispersion let you describe the spread of your dataset, or the amount of variation in your data values. Measures of dispersion like standard deviation can give you an initial understanding of the distribution of your data, and help you determine what statistical methods to apply to your data.

In this reading, you’ll learn more about three measures of dispersion: the range, variance, and standard deviation. This reading focuses on the foundational concept of standard deviation. As a data professional, you’ll frequently calculate the standard deviation of your data, and use standard deviation as part of more complex data analysis.

Measures of dispersion

Let’s examine out the definition of each measure of dispersion: the range, variance, and standard deviation.

Range

The range is the difference between the largest and smallest value in a dataset.

For example, imagine you’re a biology teacher and you have data on scores for the final exam. The highest score is 99/100, or 99%. The lowest score is 62/100, or 62%. To calculate the range, subtract the lowest score from the highest score.

99 – 62 = 37

The range is 37 percentage points.

The range is a useful metric because it’s easy to calculate, and it gives you a very quick understanding of the overall spread of your dataset.

Variance

Another measure of spread is called the variance, which is the average of the squared difference of each data point from the mean. Basically, it’s the square of the standard deviation. You’ll learn more about variance and how to use it in a later course.

Standard deviation

To get a better understanding of the concept of standard deviation, let’s explore its definition, visualization, and statistical formula.

Definition

Standard deviation measures how spread out your values are from the mean of your dataset. It calculates the typical distance of a data point from the mean. The larger the standard deviation, the more spread out your values are from the mean. The smaller the standard deviation, the less spread out your values are from the mean.

Visualization

Let’s examine the plots of three normal probability distributions to get a better idea of spread. Later on, you’ll learn about distributions, which map all the values in a dataset. For now, just know that the mean is the highest point on each curve, right in the center.

Each curve has the same mean and a different standard deviation. The standard deviation of the blue dotted curve is 1, the green solid curve is 2, and the red dashed curve is 3. The blue dotted curve has the least spread since most of its data values fall close to the mean. Therefore, the blue dotted curve has the smallest standard deviation. The red dashed curve has the most spread since most of its data values fall farther away from the mean. Therefore, the red dashed curve has the largest standard deviation.

Formula

Now let’s discuss how you calculate standard deviation using a formula.

There are different formulas to calculate the standard deviation for a population and a sample. As a reminder, data professionals typically work with sample data, and they make inferences about populations based on the sample. So, let’s review the formula for sample standard deviation:

�=Σ(�−�‾)2�−1s=n−1Σ(x−x)2

In the formula, n is the total number of data values in your sample, x is each individual data value, and x̄ (pronounced “x-bar”) is the mean of your data values. The Greek letter Sigma is a symbol that means sum.

Note: As a data professional, you’ll typically use a computer for calculations. Being able to perform calculations is important for your future career, but being familiar with the concepts behind the calculations will help you apply statistical methods to workplace problems.

To better understand the different parts of the formula, let’s calculate the sample standard deviation of a small dataset: 2, 3, 10.

You can do this in five steps:

1. Calculate the mean, or average, of your data values.

(2 + 3 +10) ÷ 3 = 15 ÷ 3 = 5

2. Subtract the mean from each value.

2 – 5 = -3

3 – 5 = -2

10 – 5 = 5

3. Square each result.

-3 * -3 = 9

-2 * -2 = 4

5 * 5 = 25

4. Add up the squared results and divide this sum by one less than the number of data values. This is the variance.

(9 + 4 + 25) ÷ (3 -1) = 38 ÷ 2 = 19

5. Finally, find the square root of the variance.

√19 = 4.36

The sample standard deviation is 4.36.

Now that you know more about the concept of standard deviation, let’s check out an example of its practical application.

Example: Real estate prices

Imagine you’re a data professional working for a real estate company. The real estate agents on your team like to inform their clients about the variation in rental prices in different residential areas. Part of your job is calculating the standard deviation of monthly rental prices for apartments in specific neighborhoods, and sharing this information with your team. Let’s say you have sample data on monthly rental prices for one-bedroom apartments in two different neighborhoods: Emerald Woods and Rock Park. Assume you calculate the mean and standard deviation for each dataset.

Emerald Woods

Apartment	#1	#2	#3	#4	#5
Monthly Rent	$900	$950	$1,000	$1,050	$1,100

Mean: $1,000

Standard deviation: $79.05

Rock Park

Apartment	#1	#2	#3	#4	#5
Monthly Rent	$500	$650	$1,000	$1,350	$1,500

Mean: $1,000

Standard deviation: $431.56

Both neighborhoods have the same mean rental price of $1,000 per month. However, the standard deviation for rental prices in Rock Park ($431.56) is much higher than the standard deviation for rental prices in Emerald Woods ($79.05). This means that there is a lot more variation in rental prices in Rock Park. This is useful information for your agents. For example, they can tell clients that it may be easier for them to find a more affordable apartment in Rock Park that is far below the mean of $1,000. Standard deviation helps you quickly understand the variation in prices in any given neighborhood.

Key takeaways

Data professionals use standard deviation to measure variation in many types of data like ad revenues, stock prices, employee salaries, and more. Measures of dispersion like the standard deviation, variance, and range let you quickly identify the variation in your data values, and get a better understanding of the basic structure of your data.

Resources for more information

To learn more about measures of dispersion like the range, variance, and standard deviation, explore the following resources:

This article from Statistics Canada provides a helpful summary of variance and standard deviation, and discusses the usefulness of standard deviation as a measure of dispersion.

Video: Measures of position

Notes

Tutorial

Quiz

Transcript

Summary: Understanding Data Through Measures of Position

This video explores measures of position, essential tools for describing datasets. These measures help you understand the relative location of individual values compared to others in the data.

Key Concepts:

Percentiles: Represent the value below which a certain percentage of data falls. For example, a score in the 95th percentile is higher than 95% of other scores.
Quartiles: Divide the data into four equal parts. Each quartile represents the median value within its quarter:
- Q1 (Lower Quartile): 25% of data points are below, 75% above.
- Q2 (Median): 50% of data points are below, 50% above.
- Q3 (Upper Quartile): 75% of data points are below, 25% above.
Interquartile Range (IQR): Measures the spread of the middle 50% of data (between Q1 and Q3).
Five-Number Summary: Summarizes data distribution using these five values: minimum, Q1, median, Q3, maximum.

Benefits:

Comparing values across different scales: Percentiles enable comparison even when data sets use different scoring systems.
Understanding data distribution: Quartiles and IQR reveal how data points are spread out.
Creating a concise overview: The five-number summary provides a quick picture of data distribution.

Applications:

These measures are used in various fields to analyze and interpret data, including:

Public health (life expectancy)
Economics (household income)
Business (product sales)

Remember: Measures of position, along with central tendency and dispersion, equip you with a comprehensive understanding of your data.

Tutorial: Exploring Data with Measures of Position

Data analysis isn’t just about finding the average or spread of data points. Understanding measures of position helps you uncover how individual values “stand” compared to others in a dataset. This tutorial empowers you to interpret data more effectively.

Learning Objectives:

Define key measures of position: percentiles and quartiles.
Calculate these measures using real-world examples.
Understand the significance of the interquartile range (IQR) and five-number summary.

Prerequisites:

Basic understanding of data sets and numerical values

Let’s Dive In!

1. Finding Your Place: Percentiles

Imagine taking a standardized test like the SAT. Your score isn’t just a number; it tells you your position relative to other test-takers. Percentiles help you understand that position within a dataset.

Definition: A percentile represents the value below which a certain percentage of data falls.
Example: Your score falls in the 90th percentile. This means your score is higher than 90% of all test-takers.

Calculating Percentiles (using software is recommended for large datasets):

Order data points from smallest to largest.
Use formulas or software tools specific to the desired percentile.

Benefits:

Compare across scales: Percentiles allow you to compare values from different datasets with different scoring systems.
Identify exceptional performers: High percentiles (e.g., 99th percentile) indicate exceptional performance compared to the majority.

2. Dividing the Data: Quartiles

Quartiles split a dataset into four equal parts, giving you a more granular picture of the data’s distribution.

Definition: Each quartile represents the median value within its specific quarter:
- Q1 (Lower Quartile): 25% of data points are below, 75% above.
- Q2 (Median): 50% of data points are below, 50% above.
- Q3 (Upper Quartile): 75% of data points are below, 25% above.

Calculating Quartiles:

Order data points from smallest to largest.
Find the middle value (Q2, the median).
The data point halfway between the beginning and Q2 is Q1.
The data point halfway between the end and Q2 is Q3.

Benefits:

Understand data spread: Quartiles reveal how data points are distributed within a dataset.
Compare data subsets: You can compare the distribution of data points within specific groups or categories using quartiles.

3. Understanding the “Middle Ground”: Interquartile Range (IQR)

Imagine the “middle 50%” of your data set. The interquartile range (IQR) measures the spread of this data.

Definition: IQR is the difference between the upper quartile (Q3) and the lower quartile (Q1).

Calculating IQR:

IQR = Q3 – Q1

Interpretation:

A larger IQR indicates a wider spread of data points within the middle 50%.
A smaller IQR signifies a more concentrated distribution in the middle of the data.

4. Five-Number Summary: A Snapshot of Distribution

The five-number summary provides a concise overview of a dataset’s distribution using five key values:

Minimum: The lowest data point.
Q1 (Lower Quartile): As defined earlier.
Q2 (Median): The middle data point.
Q3 (Upper Quartile): As defined earlier.
Maximum: The highest data point.

Benefits:

Offers a quick and easy way to visualize the data distribution.
Provides valuable information for creating data visualizations like boxplots.

5. Putting it all Together: Real-World Applications

Measures of position are used in various fields to analyze data:

Business: Assessing sales performance of different product lines by comparing their percentiles.
Finance: Analyzing portfolio performance by calculating the percentiles of individual asset returns within the portfolio.
Education: Evaluating students’ performance on standardized tests using percentiles and identifying at-risk students needing additional support.

By understanding and applying measures of position, you’ll gain a deeper understanding of your data and be able to make more informed decisions based on its distribution and relative positioning of individual values.

What measure of position divides the values in a dataset into four equal parts?

Quartile

A quartile divides the values in a dataset into four equal parts.

By now, you’ve learned
different ways to describe the center of your data with measures of central tendency, such as the mean and median. You can also use
measures of dispersion, such as the standard
deviation to represent the spread
of your data. These tools will help
you explore and better understand any dataset
you may encounter. Now, we’ll finish our tour of descriptive statistics
by checking out measures of position. Measures of position help you
determine the position of a value in relation to
other values in a dataset. Along with center and spread, it’s helpful to know the
relative position of the values. For example, whether one value is higher or lower than another, or whether a value
falls in the lower, middle, or upper portion
of your dataset. In a city, this is
similar to knowing where different places of interest are located in relation
to one another. For example, it’s useful to know how far the art museum is from the city park or if the
famous restaurant you want to eat at is close to the historical monument
you want to visit. In this video, you’ll learn about the most
common measures of position, percentiles
and quartiles. You’ll also learn
how to calculate the interquartile range and use the five number summary
to summarize your data. A percentile is the value below which a
percentage data falls. Percentiles show the
relative position or rank of a particular
value in a dataset. Some universities ask applicants to take standardized tests. For example, in
the United States, the SAT and the ACT
are common exams. When a student receives
their test score, they usually also receive a
corresponding percentile. For example, let’s
say a test score falls in the 99th percentile. This means the score
is higher than 99% of all test scores. If a score falls in
the 75th percentile, the score is higher than
75% of all test scores. If the score falls in
the 50th percentile. The score is higher than half or 50% of all test
scores and so on. Percentiles are useful
for comparing values. For example, different exams may have different
scoring systems. SAT scores range 400-1600, ACT scores range 1-36, and a typical school exam in math or history may range 0-100. If you only know the raw
scores for each test, say a 1,000 for the SAT, 20 for the ACT, and 70 for the school exam, you have no way of making
a meaningful comparison. If you know that all
three test scores fall in the 50th percentile, then you can
meaningfully compare student performance across
the different exams. You can use quartiles to get a general understanding of the relative position of values. A quartile divides the values in a dataset into four equal parts. Quartiles let you compare values relative to the four
quarters of data. Each quarter includes 25% of
the values in your dataset. The first quartile is the middle value in the
first half of the dataset. The first quartile, Q1 is also called the
lower quartile, 25% of the data points are
below Q1 and 75% are above it. The second quartile is the
median value in the set. Q2 is the median, 50% of the data points are
below Q2 and 50% are above it. The third quartile is the middle value in the
second half of the dataset, 75% of the data points are
below Q3 and 25% are above it. Note the relationship between
quartiles and percentiles. Q1 refers to the
25th percentile, Q2 to the 50th percentile, and Q3 to the 75th percentile. For example, say you’re the
manager of a sports team. You have data that
shows how many goals each player on your team scored over the course of
an entire season. You want to compare
the performance of each player based on scoring. You can calculate quartiles for your data using these steps. First, arrange the values
from smallest to largest, 11, 12, 14, 18, 22, 23, 27, 33. Second, find the median
of your data set. This is the second quartile, Q2. There are an even number of
values in the dataset so the median is the average
of the two middle values, 18 and 22, Q2 equals 20. Third, find the median of the lower half of your dataset. This is the lower quartile, Q1, Q1 equals 13. Finally, find the median of the upper half of your dataset. This is the upper quartile, Q3, Q3 equals 25. Breaking the data into
quartiles gives you a clear idea of
player performance. You now know that the lower quartile of players
scored 13 goals or less, and the upper quartile
scored 25 goals or more. In other words, the lower 25% of players scored
13 goals or less, and the upper 25% scored
25 goals or more. The middle 50% of players
scored between 13 and 25 goals. The middle 50% of your data is called the interquartile
range or IQR. The interquartile range is the distance between
the first quartile, Q1, and the third quartile, Q3. Technically, IQR is a measure of dispersion
because it measures the spread or the middle
50% of your data. This is the same as
the distance between the 25th and 75th percentiles, or between Q1 and Q3. IQR is also useful for determining the relative
position of your data values. IQR equals Q3 minus Q1. In this case, Q3 equals
25 and Q1 equals 13, so IQR equals 12. Finally, you can summarize
the major divisions in your dataset with the
five-number summary. The five numbers
include: the minimum, the first quartile, the
median or second quartile, the third quartile,
and the maximum. For your sports data, the
five-number summary is 11, 13, 20, 25, 38. The five-number
summary is useful because it gives you
an overall idea of the distribution
of your data from the extreme values
to the center. You can visualize
it with a box plot. The box part of the
boxplot goes from the first quartile to
the third quartile. The vertical line in the middle
of the box is the median. The horizontal lines on each side of the box,
known as whiskers, go from the first quartile to the minimum value and from the third quartile
to the maximum value. The following boxplot
shows the data on goals. We can find the values on the boxplot and determine
the inter-quartile range. Q1, or the lower
quartile equals 13, Q3, or the upper
quartile equals 25. The interquartile range
is the length of the box, 25 -13 = 12. Data professionals use
measures of position such as percentiles and
quartiles to better understand all kinds of data. This may include
public health data such as life expectancy, economic data such
as household income, business data such as
product sales and more. That concludes our tour of
descriptive statistics. Coming up, you’ll use Python to compute descriptive stats
and summarize a dataset.

Reading: Measures of position: Percentiles and quartiles

Reading

Measures-of-position_-Percentiles-and-quartiles Download

Video: Alok: Statistics as the foundation of data-driven solutions

Notes

Transcript

Alok on the Importance of Statistics in Data Science:

Alok, a Data Science Developer Advocate at Google Cloud, emphasizes the value of statistics for data professionals.

Definition: Statistics is the combination of applying mathematics to analyze data.
Importance:
- Provides a foundation for the mathematical techniques used in data science.
- Offers a broad perspective on applying math to various data problems.
Real-world Example: Alok used statistical methods to analyze user behavior differences between two groups (A and B).
- Initial observation: Significant discrepancies in a specific metric were found.
- Statistical analysis revealed:
  - Adjusted for confounding factors like user mix.
  - Calculated confidence intervals around the observed difference.
  - Conclusion: The observed difference was not statistically significant and could be attributed to other factors.
- Outcome: Executives were relieved and made an informed decision based on the analysis.
Learning Statistics: Completing a data science program provides a strong foundation for data analysis and impactful career opportunities.
Advice for Learners:
- Maintain a clear end goal (e.g., new skill or career change) for motivation.
- Take a step-by-step approach and forgive yourself for setbacks.
- Stay focused on your ultimate learning objectives.

Overall, Alok highlights the significance of statistics in data science and encourages aspiring data professionals to persevere in their learning journey.

I’m Alok. I’m a Data Science Developer
Advocate at Google Cloud. What we do generally is talk to developers about how
to use Google Cloud. The way I think about
statistics is that it’s a combination of mathematics
and applying it to data. It’s important for
data professionals to learn about statistics
because I think it gives you a good
foundation for the math behind the techniques
you’re going to apply, and to give a breadth of
information for how you might apply this math
to different problems. In my first job at Google, where I was a data scientist
on the search ads teams, we used statistical methods to generate insights that informed
decisions all the time. In fact, it was essentially
the core of the job. One time where statistics
helped influence decision-makers
from my experience was a project where
we had two groups, let’s say on A and B. We were seeing some
big differences in the behavior of these two
groups in a particular metric. The executives were worried about why were they different, maybe it shouldn’t
be so different. Statistical methods
were crucial to this. We had to adjust for
things like mix effects, looking at different
slices of our data, adding confidence
intervals around the mean difference
we’re seeing, and what we found was
that the difference was not as large as we had seen, and it could be attributed
to these other things like mix among our users
or things like that. Then when we presented
this to the executives, they were relieved that the
differences were not as big, and there wasn’t something
necessarily to do to change the dynamic
between group A and B. It was within the
reasonable difference that they felt was okay. The idea is
essentially statistics will give you a set of tools. In this case, it
gave me a set of tools to decompose
this problem into various parts and start to explain why we were seeing the differences we were seeing. The value of completing a program like this one
is that it gives you the foundation to do a lot of great work in data science
and data analytics. Having done some
coursework in data, as well as having some
project experiences, sets you up well to be
able to analyze data and make an impact in wherever industry you end up working in. The best thing I can tell those who are in
the middle of this and maybe struggling through is keep your end goal in mind, whether it’s
learning a new skill or opening up a whole
new career path, that can really be a
differentiator for you. Try to keep it step-by-step. If you fall a little bit behind, forgive yourself, and just keep that end goal in mind. That’s
where you want to get.

Practice Quiz: Test your knowledge: Descriptive statistics

A data professional is analyzing sales data for an online store. The most frequently occurring value in the dataset is $150. What term is used to describe this value?

Mode

Mode is used to describe this value. The mode is the most frequently occurring value in a dataset.

What do measures of dispersion represent?

The spread of a dataset

Measures of dispersion represent the spread of a dataset, or how spread out the values are from the center.

Which of the following descriptive statistics are measures of position? Select all that apply.

Quartile, Percentile

Measures of position include quartile and percentile. A quartile divides the values in a dataset into four equal parts. A percentile is the value below which a percentage of data falls.

Calculate statistics with Python

Lab: Annotated follow-along guide: Compute descriptive statistics with Python

Reading

Instructions

All of the instructional videos with onscreen coding demonstrations have a corresponding follow-along guide that is available to you. The follow-along guide is an annotated Jupyter notebook organized to match the content from each module. It contains the same code shown in the videos for that module. This guide is provided for your reference; you do not need to add any text or run the code yourself. If you would like to run the code, you will need to run each cell sequentially for the code to function as intended.

In addition to content that is identical to what is covered in the videos, you’ll often find additional information throughout the guide to explain the purpose of each concept covered, why the code is written in a certain way, and tips for running the code.

The landing page for each follow-along notebook also provides information about data sources (when relevant) and tips on how to access and use these guides.

Data dictionary

This notebook uses a file called education_districtwise.csv. This dataset represents a list of school districts in an anonymous country. The data includes district and state names, total population, and the literacy rate.

The dataset contains:

680 rows – each row is a different school district

7 columns

Column name	Type	Description
DISTNAME	str	The names of an anonymous country’s school districts
STATNAME	str	The names of an anonymous country’s states where school districts are located
BLOCKS	int64	The number of blocks in the school district. Blocks are the smallest organizational structure in the education system of the anonymous country.
VILLAGES	int64	column shows how many villages are in each district
CLUSTERS	int64	The number of clusters in the school district. Clusters are the second smallest organizational structure in the education system of the anonymous country.
TOTPOPULAT	int64	column shows the population for each district
OVERALL_LI	int64	shows the literacy rate for each district

Note: The data in this project was derived from this data set on Kaggle. However, it has been modified for pedagogical purposes.

Remember, you can access and download the data for any Jupyter notebook activity from within the notebook itself by navigating to the Lab Files dropdown menu at the top of the page, clicking into the /home/jovyan/work folder, selecting the relevant data file, and clicking Download.

Set up a split screen

While watching the videos that follow this notebook, you may find it helpful to track the instructor’s progress by following along in your own Jupyter Notebook. It can be helpful to review the code notebook alongside the video, especially if you’re new to coding in Python.

To do so:

Open the video in one browser window.
Then, open the annotated follow-along guide in a separate window.
Arrange your screen so that the video and the follow-along guide are both visible.

Tips for working with the follow-along guide

Follow these suggestions to enhance your learning experience:

Reference the Jupyter notebooks reading before starting if you need more information on working with the Jupyter notebooks.
Go to the section of the follow-along guide for the current module’s content. The follow-along guide has different sections for each video included in the module’s content. The in-video message will direct you to the relevant section in the guide for the specific video you are viewing.
Follow along in the notebook as the instructor discusses the code.

PDF

Annotated-follow-along-guide_Compute-descriptive-statistics-with-Python Download

Video: Compute descriptive statistics with Python

Notes

Tutorial

Transcript

Python for Descriptive Statistics: Analyzing Literacy Rates

This course teaches you how to use Python to compute descriptive statistics and summarize data sets.

Example: Analyzing literacy rates across districts in a large nation.

Steps:

Import libraries: numpy (np), pandas (pd), and matplotlib.pyplot (plt).
Load data: Use a provided CSV file named education_districtwise.csv.
Data overview: Use head() to view the first few rows and understand column names.
Descriptive statistics:
- Overall literacy rate:
  - Use describe() to get summary statistics (mean, median, quartiles, etc.) for the “OVERALL_LI” column.
  - Calculate the range (difference between highest and lowest values) using max() and min().
- State information:
  - Use describe() on the “State name” column to find the number of states and the state with the most districts.
Interpretation:
- The mean literacy rate is 73%, providing a general understanding of the national average.
- The large range (61.5%) indicates significant variation between districts.
- Knowing the state with the most districts helps target resource allocation.

Key takeaway: Descriptive statistics offer a preliminary understanding of your data, crucial for further analysis.

Additional notes:

Python provides individual functions (mean(), median(), etc.) for specific statistics.
This example focuses on numeric data. The describe() function behaves differently for categorical data.

Python for Descriptive Statistics: Analyzing Literacy Rates

This tutorial guides you through using Python to compute descriptive statistics and analyze literacy rates across districts in a fictional large nation. We’ll utilize the pandas library for data manipulation and NumPy for numerical calculations.

Prerequisites:

Basic understanding of Python programming
Familiarity with data analysis concepts like mean, median, and standard deviation

Software:

Python 3 (download and install from https://www.python.org/downloads/)
Jupyter Notebook (recommended, install using pip install jupyter) or any other Python IDE

Steps:

Import Libraries: Open your Jupyter Notebook or Python environment and import the necessary libraries: Pythonimport pandas as pd import numpy as np
Load Data: Download the provided CSV file named education_districtwise.csv (replace with your actual file path if different). Use pd.read_csv to load the data into a pandas DataFrame: Pythondata = pd.read_csv("education_districtwise.csv")
Data Overview: Get a glimpse of the data using the head() function to display the first few rows: Pythondata.head() Examine the column names to understand the information they represent.
Descriptive Statistics:a) Overall Literacy Rate:
- Summary statistics: Utilize the describe() function on the “OVERALL_LI” column (assuming this column holds literacy rates) to obtain various descriptive statistics like mean, median, standard deviation, quartiles, minimum, and maximum values: Pythonliteracy_stats = data["OVERALL_LI"].describe() print(literacy_stats)
- Range: Calculate the range (difference between highest and lowest values) using NumPy’s max() and min() functions: Pythonrange_literacy = np.max(data["OVERALL_LI"]) - np.min(data["OVERALL_LI"]) print("Range of literacy rates:", range_literacy)
b) State Information:
- Number of states: Apply describe() to the “State name” column to find the number of unique states: Pythonstate_info = data["State name"].describe() num_states = state_info["unique"] print("Number of states:", num_states)
- State with most districts: Use the mode() function to identify the state with the most districts (assuming the mode represents the most frequent value): Pythonmost_districts_state = data["State name"].mode()[0] print("State with most districts:", most_districts_state)
Interpretation:
- Analyze the obtained statistics. The mean literacy rate provides a general understanding of the national average. The range indicates the spread of literacy rates across districts. Knowing the state with the most districts can be helpful for further analysis and resource allocation.

Additional Notes:

This example showcases basic descriptive statistics. Python offers various other functions for specific calculations like standard deviation (std()) and variance (var()).
Remember to replace "education_districtwise.csv" with the actual path to your data file.
This tutorial can be further extended to explore relationships between literacy rates and other factors in the data through visualization and statistical tests.

Conclusion:

By utilizing Python libraries like pandas and NumPy, you can efficiently analyze data and gain valuable insights through descriptive statistics. This provides a strong foundation for further data exploration and advanced analysis techniques.

Recently, you learned how descriptive
statistics help you explore and summarize key features of your data. We talked about three different
types of descriptive stats. Measures of central tendency like the
mean, refer to the center of your dataset. Measures of dispersion like
the standard deviation, described the spread of your data set. Measures of position like percentiles and
quartile, show the relative location
of your data values. Earlier in the program, you learned about the process of
exploratory data analysis or EDA. From discovering to presenting your data. Whenever a data professional works with
a new data set, the first step is to understand the context of the data
during the discovery stage. Often, this is done by discussing
the data with project stakeholders and reading documentation about the data set,
and the data collection process. After that, a data professional
moves on to data cleaning, and deals with issues like missing data and
incorrect values. Computing descriptive stats is a common
step to take after data cleaning. Now, you’ll learn how to use python
to compute descriptive statistics and summarize the data set. The great thing about using python for
stats is that it does all the difficult work for you and takes care of
all the complex calculations. With a single line of code and the push
of a button, you have the mean, median, standard deviation and more. Using python is like having a friendly
math genius working right alongside you, a genius who never gets tired or
distracted or needs a coffee break. Also, you’re well prepared
to use python for stats. You’ve already learned how to use
python to organize, clean and visualize your data. In this course, we’ll introduce some
new functions specific to stats. But we’ll continue to use the same syntax
and coding concepts you’re familiar with. Now, let’s explore our example. Imagine you’re a data professional working
for the government of a large nation. The government’s Department of Education
is seeking to understand the current literacy rates across the country. The literacy rate is defined
by the percentage of the population of a given age
group that can read and write. You’re asked to analyze data about the
literacy rate among primary and secondary school students, these are students
who range in age from 6 -18 years old. There’s data available for
every State and district in the country. You’ll use descriptive stats to get
a basic understanding of the literacy rate data for each district. So, let’s open a jupyter notebook and
get started. To start import the python packages
you will use, numpy and pandas. And the library,
you’ll use matplotlib.pyplot. You worked with numpy, pandas and matplotlib.pyplot when
you learned about EDA. They’ll also help you
compute descriptive stats. To save time, rename each package and library with an abbreviation,
np, pd and plt. We’ve provided a file to download so
you can follow along. For this example, we’ll name our data,
education_districtwise. This tells us that we’re dealing with
education data organized by district. As a best practice, choose names that
clearly State the main content or purpose of the data. This way, you can easily access and
remember your data in the future. Start with the head function to get
a quick overview of the first ten rows of your data set. Recall that head will return as many rows
of data as you input into the variable field. Before you start computing
descriptive stats, review the contents of your dataset to
understand what the column headers mean. Your data set has seven columns 680 rows. The first five columns refer to
different administrative units, district name, State name,
blocks, villages and clusters. In other words,
this is the system the nation uses for organizing its population into
different sized units or sections. The nation is divided into States,
States are further divided into districts. A large State may have
more than 40 districts, a small State may have
fewer than four districts. Each district is further divided
into blocks, clusters and villages. The total population column, abbreviated TOTPOPULAT,
refers to population. The overall literacy rate column, abbreviated OVERALL_LI,
refers to literacy rate. To interpret this data correctly, it’s
important to understand that each row or observation, refers to
a different district. And not for
example to a State or a village. So, the village column shows how
many villages are in each district. The total population column shows
the population for each district. The overall literacy column shows
the literacy rate for each district. Now that you have a better
understanding of your dataset, use python to compute district of stats. When competing descriptive
stats in python, the most useful function
to know is describe. Data professionals use the describe
function as a convenient way to calculate many key stats all at once. If you use this function for a column
with numeric data, you get account of all the observations in the column,
along with the following stats. The mean or average value,
the median or middle value, the standard deviation or the value
that measures the spread of the data, the minimum and maximum values and
the 1st and 3rd quartiles. Your main interest is the literacy rate. This data is contained in
the overall literacy column, which shows the literacy rate for
each district in the nation. Used the describe function to show
key stats about literacy rate. The output lists key stats for
all districts, and the count category confirms there
are 634 districts in your data set. Note that the number of observations for
the overall literacy column is 634, but the number of rows
in the dataset is 680. This is because the describe function
does not include missing values. The summary of stats, gives you valuable information
about the overall literacy rate. For example, the mean helps to
clarify the center of your data set, you now know the average literacy
rate is about 73% for all districts. This information is useful in itself,
and also as a basis for comparison. Knowing the mean literacy rate for
all districts, helps you understand which individual districts are significantly
above or below the mean. This will help the department decide how
to devote resources to improving literacy. Note that the category is 25%, 50% and 75% refer to Q1, Q2 and Q3 respectively. Remember that Q2 is also
the median of your data set. You can also use the describe function for
column with categorical data, like the State name column. In this case,
you get a count of all the observations in the column along with
the following information. The number of unique values,
the most common value or mode and the frequency of the most common value. Using the describe function to find out
how many States are in your data set and which State has the most districts. The unique category, shows you that
there are 36 States in your data set. The top category shows that you that
State 21 is the most common value, and contains the most districts. The frequency category, tells you
that State 21 appears in 75 rows, which means it includes 75 different
districts, this is the mode. This information may be helpful
in determining which States will need more educational resources
based on their number of districts. The describe function is so useful because it shows you
a variety of key stats all at once. Python also has separate functions for
stats, such as mean, median, standard deviation, minimum and maximum. And you use the mean and median functions
earlier in the program to detect outliers. These individual functions are also useful
if you want to do further computations based on descriptive stats. For example, you can use the min and max functions together to
compute the range of your data. The range will show you
the difference between the highest and lowest literacy rates among all districts. To compute the range, use the max and min
functions to subtract the highest literacy rate from the lowest literacy rate. First name a new variable,
range_overall_LI. Then, use the max function on the overall
literacy column in put a minus sign and use the min function on the same column. Finally, display the value
of your variable. The range in literacy rates for
all districts is about 61.5% points, this is the maximum value of 98.7%
minus the minimum value of 37.2%. The large difference tells you that some
districts have much higher literacy rates than others. In an upcoming video,
You’ll continue to analyze this data and you can discover which districts
have the lowest literacy rates. This will help the government better
understand literacy rates nationally and build on their successful
educational programs. Using descriptive stats to summarize your
data set, is an important early step in the analysis process, giving you
a basic understanding of your data.

Lab: Activity: Explore descriptive statistics

Reading

Instructions

In this lab, you will compute descriptive statistics with Python. You will be presented with a business scenario and a dataset to use as you explore the scenario. You will practice leveraging functions in Python and computing the mean, median, minimum and maximum values to uncover and analyze information from within the set.

To complete this lab, you will open a Jupyter notebook and follow instructions to enter code and written responses where prompted.

Data Dictionary

This activity uses air quality index (AQI) data from the United States Environmental Protection Agency (EPA). The data was taken from the EPA’s own repository of AQI data and assembled for pedagogical purposes.

This activity uses a dataset called c4_epa_air_quality.csv. The variables in the dataset have been adjusted to suit the objectives of this lab, and may vary from other versions of the dataset you have worked with previously. It contains air quality readings for various counties in the United States. For more information about AQI data, refer to AirNow.

The dataset contains:

260 rows – each row is a different aqi reading

10 columns

Column name	Type	Description
[None]	int	index
date_local	datetime	Date of AQI reading
state_name	str	The name of the state
county_name	str	The name of the U.S. county
city_name	str	U.S. City where monitoring occurred
local_site name	str	The identifier of the site in the owning agency’s (e.g., not US EPA) nomenclature
parameter_name	str	The description assigned in the air quality system to the parameter measured by the monitor. Parameters may be pollutants or non-pollutants (e.g., wind speed).
units_of_measure	str	The unit of measure for all data in that row. Every parameter has a standard unit of measure.
aritmetic_mean	int	The measure of central tendency obtained from the sum of the observed pollutant data values in the quarterly data set divided by the number of values that comprise the sum for the quarterly data set. For criteria pollutants, the sum of values only adds the values with the appropriate flagging and concurrence for the exceptional data type.
aqi	int	“Air Quality Index” – AQI is like a thermometer that runs from 0 to 500. The higher the AQI value, the greater the level of air pollution and the greater the health concern.

You can download the csv file from within the notebook itself by navigating to the Lab Files dropdown menu at the top of the page, clicking into the /home/jovyan/work folder, selecting the file, and clicking Download.

PDF

Activity_Explore-descriptive-statistics Download

Lab: Exemplar: Explore descriptive statistics

PDF

Exemplar_Explore-descriptive-statistics Download

Practice Quiz: Test your knowledge: Calculate statistics with Python

What Python function can data professionals use to compute the mean, median, and standard deviation all at once?

describe()

Data professionals can use the describe() function to compute the mean, median, and standard deviation all at once.

What two Python functions can you use to compute the range of your dataset?

max() and min()

To compute the range of your dataset, you subtract the minimum value from the maximum value. The max() function returns the maximum value. The min() function returns the minimum value. Range = max() – min().

Review: Introduction to statistics

Video: Wrap-up

Notes

Transcript

Statistics Course: Section 1 Recap

This summary highlights the key takeaways from the first section of the statistics course:

Learning Objectives:

Understand how data professionals use statistics to analyze data and draw insights.
Distinguish between descriptive and inferential statistics.
Apply descriptive statistics to:
- Summarize data using measures of central tendency (mean, median, mode).
- Analyze variation in data using measures of dispersion (range, variance, standard deviation).
- Identify outliers and understand their significance using measures of position (percentiles, quartiles).
Utilize Python to explore and analyze datasets, calculating descriptive statistics efficiently.

Key Takeaways:

Statistics empowers data professionals to understand and interpret data, aiding business leaders in decision-making and problem-solving.
Learning descriptive statistics equips you to explore, summarize, and gain insights from any data encountered in your future career.

Next Steps:

Prepare for the graded assessment by reviewing the new terms list.
Utilize provided resources (videos, readings) to solidify key concepts covered in this section.

Congratulations:

Acknowledgement of your progress and encouragement to continue learning.

We’ve come to the end of the first
section of your statistics course, wow. You’ve learned a lot already. Along the way, we’ve explored how data
professionals use statistics to gain insights from their data. This helps business leaders make
decisions and solve complex problems. We begin with the two main types of
statistics, descriptive and inferential. Data professionals use descriptive stats
to explore and summarize their data. They use inferential stats
to draw conclusions and make predictions about their data. During your tour of descriptive stats, you learned about measures of central
tendency, dispersion, and position. Finally, you learn that Python is
a powerful tool for statistical analysis. You use Python to explore a dataset and quickly calculate descriptive
statistics to summarize your data. You can use these skills to better
understand any new data set you may encounter in your future career. Coming up, you have a graded assessment. To prepare, check out the reading that
lists all the new terms you’ve learned. And feel free to revisit videos,
readings, and other resources that cover key concepts. Congrats on your progress so far,
and I’ll meet you again soon.

Reading: Glossary terms from module 1

Terms and definitions from Course 4, Module 1

A/B testing: A way to compare two versions of something to find out which version performs better

Confidence interval: A range of values that describes the uncertainty surrounding an estimate

Descriptive statistics: A type of statistics that summarizes the main features of a dataset

Econometrics: A branch of economics that uses statistics to analyze economic problems

Inferential statistics: A type of statistics that uses sample data to draw conclusions about a larger population

Interquartile range: The distance between the first quartile (Q1) and the third quartile (Q3)

Literacy rate: The percentage of the population in a given age group that can read and write

Mean: The average value in a dataset

Measure of central tendency: A value that represents the center of a dataset

Measure of dispersion: A value that represents the spread of a dataset, or the amount of variation in data points

Measure of position: A method by which the position of a value in relation to other values in a dataset is determined

Median: The middle value in a dataset

Mode: The most frequently occurring value in a dataset

Parameter: A characteristic of a population

Percentile: The value below which a percentage of data falls

Population: Every possible element that a data professional is interested in measuring

Quartile: A value that divides a dataset into four equal parts

Range: The difference between the largest and smallest value in a dataset

Representative sample: A sample that accurately reflects the characteristics of a population

Sample : A subset of a population

Sampling: The process of selecting a subset of data from a population

Standard deviation: A statistic that calculates the typical distance of a data point from the mean of a dataset

Statistic: A characteristic of a sample

Statistical significance: The claim that the results of a test or experiment are not explainable by chance alone

Statistics: The study of the collection, analysis, and interpretation of data

Summary statistics: A method that summarizes data using a single number

Variance: The average of the squared difference of each data point from the mean

Quiz: Module 1 challenge

A data team at a car dealership wants to improve open rates for their company’s weekly email campaign. They send two versions of the weekly email. Half of the customers get a subject line about new car colors, and half get a subject line about new car interiors. What does this scenario describe?

A/B testing

Gemini

The scenario described is A/B testing. So the answer is 2.

Here’s why:

A/B testing involves comparing two different versions of something (in this case, email subject lines) to see which one performs better.
In this scenario, the data team aims to improve open rates, so they’re testing two different subject lines to determine which one leads to a higher percentage of customers opening the email.
Hypothesis testing can be involved in A/B testing, but it’s a broader concept that can also be applied in other situations.
Regression analysis is used to model the relationship between variables, not directly applicable to comparing two different email subject lines.
Time series analysis focuses on analyzing data points collected over time, not directly relevant to the immediate comparison of two subject lines.

Therefore, A/B testing best describes the scenario where the data team is comparing two email subject lines to see which one leads to a higher open rate.

ChatGPT

A/B testing.

This scenario describes A/B testing. A/B testing, also known as split testing, is a method used to compare two versions of something to determine which one performs better. In this case, the data team at the car dealership is testing two different subject lines (about new car colors and new car interiors) to see which one results in higher open rates for the weekly email campaign. A/B testing helps businesses make data-driven decisions by comparing the performance of different variations in a controlled experiment. Hypothesis testing, regression analysis, and time series analysis are different statistical and analytical methods used for various purposes but are not specifically related to comparing different versions in the same manner as A/B testing.

Which of the following statements correctly describe key elements of inferential statistics? Select all that apply.

A sample can be used to draw conclusions about an entire population.
Data professionals use inferential statistics to predict behaviors.
The dataset that a sample is drawn from is called the population.

Gemini

Out of the choices provided, the following statements correctly describe key elements of inferential statistics:

1. A sample can be used to draw conclusions about an entire population. This is a fundamental principle of inferential statistics. Since collecting data from the entire population can be impractical or impossible, we use a representative sample to make inferences about the larger group.
3. The dataset that a sample is drawn from is called the population. This terminology is crucial in inferential statistics. The population refers to the entire group of individuals or objects we are interested in studying, while the sample is a smaller subset drawn from the population used for analysis.

Incorrect statements:

2. Data professionals use inferential statistics to predict behaviors. While predicting behaviors can be a goal, inferential statistics can also be used for other purposes like estimating population parameters (averages, proportions) or testing hypotheses.
4. Sample size has minimal effect on the validity of test results. Sample size directly impacts the reliability of inferences drawn from the sample. A larger sample size generally leads to more accurate and generalizable conclusions about the population.

Therefore, statements 1 and 3 accurately represent key characteristics of inferential statistics.

ChatGPT

The correct statements about key elements of inferential statistics are:

A sample can be used to draw conclusions about an entire population.
This is a fundamental principle of inferential statistics. By analyzing a sample drawn from a population, conclusions and predictions can be made about the entire population.
Data professionals use inferential statistics to predict behaviors.
Inferential statistics can be used to make predictions or inferences about behaviors, trends, or outcomes based on the analysis of sample data.

The statements that are not correct:

The dataset that a sample is drawn from is called the population.
This is incorrect. The population is the entire group that the researcher is interested in, and the sample is a subset of that population.
Sample size has minimal effect on the validity of test results.
This is incorrect. Sample size can have a significant effect on the validity of test results. Larger sample sizes generally lead to more reliable and accurate statistical analyses.

A data professional working for a water conservancy researches household water usage in a large city. Their dataset contains about 800,000 rows of data capturing how much water each household uses in a month. The data professional creates visualizations to quickly understand the data and create a summary for stakeholders. What does this scenario describe?

Descriptive statistics

Gemini

The scenario described is descriptive statistics. So the answer is 2.

Here’s why:

Descriptive statistics:
- Focuses on summarizing and describing the characteristics of a dataset.
- Involves techniques like calculating central tendency (mean, median, mode), dispersion (variance, standard deviation), and creating visualizations (e.g., histograms, box plots) to understand the data distribution.
- Aims to summarize the data itself, not make inferences about a larger population or test hypotheses.
While the scenario mentions creating a summary, it doesn’t involve elements characteristic of other options:
- Statistical significance: Refers to the level of evidence against a null hypothesis in hypothesis testing, not directly relevant here.
- Confidence intervals: Provide a range of values likely to contain the population parameter (e.g., mean) based on the sample data, not the core focus here.
- Inferential statistics: Aims to draw conclusions about a larger population beyond the data at hand, not solely focused on summarizing the given dataset.

Therefore, the data professional’s use of visualizations and summary creation aligns with the primary objective of descriptive statistics: understanding and summarizing the characteristics of the existing water usage data.

Descriptive statistics.

This scenario describes the use of descriptive statistics. Descriptive statistics involve summarizing and presenting the main features of a dataset, such as the mean, median, mode, range, and other measures that provide an overview of the data’s characteristics. In this case, the data professional is creating visualizations and summaries to understand the household water usage in the large city, which falls under the category of descriptive statistics. Statistical significance, confidence intervals, and inferential statistics involve making inferences about populations based on samples and testing hypotheses, which is not explicitly described in this scenario.

Fill in the blank: A _____ is a characteristic of a population.

parameter

Gemini

The answer is parameter.

Here’s why:

A parameter is a numerical characteristic that describes a population. It represents a fixed value that summarizes some aspect of the entire population. Examples include population mean, median, or standard deviation.
Measure: This term is more general and can refer to any numerical characteristic calculated from data, whether from a sample or a population.
Range: This refers to the difference between the largest and smallest values in a dataset, not specific to a population.
Sample: This refers to a subset of the population used for data collection and analysis. It is not a characteristic of the population itself.

Therefore, a parameter is the specific term used to describe a numerical characteristic of an entire population.

ChatGPT

parameter

A parameter is a characteristic of a population. In statistics, parameters are values that describe various features of a population, such as the mean, standard deviation, proportion, etc. Parameters are typically represented by Greek letters (e.g., μ for the population mean).

A company conducts an employee satisfaction survey. Employees rate their work experience as unacceptable, average, good, or excellent. The most frequently occurring value in the survey is excellent. What descriptive statistics concept best describes this value?

Mode

Gemini

The most frequently occurring value in the survey data is the mode. So the answer is 3.

Here’s why:

Mode: This term refers to the value that appears most often in a dataset. In the employee satisfaction survey, “excellent” is the most frequently chosen response, making it the mode.
Mean: This represents the average of all the values in the dataset. It’s calculated by adding all the values and dividing by the total number of observations.
Median: This refers to the “middle” value in a dataset arranged in order from lowest to highest. Since the company uses a categorical rating scale, calculating the median wouldn’t be appropriate here.
Standard deviation: This measures the spread or variability of the data points around the mean. It’s not relevant to identifying the most common value.

Therefore, the mode best describes the most frequently occurring value (“excellent”) in the employee satisfaction survey data.

ChatGPT

Mode

The descriptive statistics concept that best describes the most frequently occurring value in a dataset is the Mode. In this scenario, where “excellent” is the most frequently occurring rating in the employee satisfaction survey, “excellent” is the mode of the dataset. The mode is the value that appears most frequently in a set of data.

A data professional works with the following dataset: 2, 2, 4, 7, 10. What is the mode of the dataset?

Gemini

The mode of the dataset is 2.

It appears twice in the data set, more than any other value.

# Calculate the mode (most frequent value) of the data set
data = [2, 2, 4, 7, 10]

# Create a dictionary to store the frequency of each value
value_counts = {}
for value in data:
  if value in value_counts:
    value_counts[value] += 1
  else:
    value_counts[value] = 1

# Find the value with the highest count
mode = max(value_counts, key=value_counts.get)

# Print the mode
print(mode)

ChatGPT

The mode of a dataset is the value that appears most frequently. In the given dataset (2, 2, 4, 7, 10), the number “2” appears more frequently than any other number, making it the mode of the dataset. Therefore, the correct answer is 2.

What concept best describes the standard deviation, variance, and range?

Measures of dispersion

Gemini

The concept that best describes standard deviation, variance, and range is measures of dispersion. So the answer is 4.

Here’s why:

Measures of dispersion: These statistics quantify the spread or variability of data points around a central value (like the mean or median). They help understand how close or far data points are clustered together.
Standard deviation, variance, and range:
- Standard deviation: Measures the average distance of data points from the mean.
- Variance: Represents the squared standard deviation, also reflecting the spread of data.
- Range: Captures the difference between the highest and lowest values in the data set.
Measures of frequency: These statistics describe how often specific values appear in a dataset (e.g., mode, frequency tables).
Measures of position: These statistics identify specific locations within an ordered data set (e.g., quartiles, percentiles).
Measures of central tendency: These statistics summarize the “center” of the data (e.g., mean, median, mode).

Therefore, standard deviation, variance, and range all contribute to understanding the dispersion or spread of data, making it the most appropriate concept to describe them.

ChatGPT

Measures of Dispersion

The concept that best describes the standard deviation, variance, and range is Measures of Dispersion. These statistics provide information about the spread or variability of a dataset. Measures of dispersion help quantify how much the individual data points differ from the central tendency (e.g., mean or median) and give insights into the distribution of the data points within the dataset. Standard deviation, variance, and range are all examples of measures of dispersion.

A data professional is analyzing tomato growth data. Their dataset includes the circumference of tomatoes in millimeters: 40, 49, 50, 52, 66.3, 77.5, 78, 80. What is the range of their dataset?

Gemini

The range of the dataset is 40.

The range is calculated by finding the difference between the highest and lowest values in the data set. In this case, the highest value is 80 and the lowest value is 40, so the range is 80−40=40.

Statistic	Value
Range	40

# Data
data = [40, 49, 50, 52, 66.3, 77.5, 78, 80]

# Calculate the range
data_range = max(data) - min(data)

# Print the range
print(data_range)

ChatGPT

The range of a dataset is calculated as the difference between the maximum and minimum values. In this dataset (40, 49, 50, 52, 66.3, 77.5, 78, 80), the minimum value is 40, and the maximum value is 80.

Range = Maximum value – Minimum value

Range = 80 – 40 = 40

Therefore, the correct answer is 40.

A data professional is analyzing data about annual work income in dollars. They divide the data into quartiles: Q1 = $40,000, Q2 = $55,000, Q3 = $70,000. What percentage of the values in their dataset are above $70,000?

25%

Gemini

The data professional has divided the income data into quartiles, meaning each quartile represents 25% of the data. Since Q3 is $70,000, it represents the 75th percentile of the data.

Therefore, 75% of the values in the dataset are below or equal to $70,000. This means the remaining 25% of the values are above $70,000.

So the answer is 25%.

ChatGPT

25%

In quartiles, Q3 represents the third quartile, which is the 75th percentile. This means that 75% of the values in the dataset are below $70,000. Therefore, the percentage of values above $70,000 is the complementary percentage to 75%, which is:

100% – 75% = 25%

So, the correct answer is 25%.

If you apply the describe() function to categorical data, the results will include which of the following descriptive statistics?

Mode

Gemini

When you apply the describe() function to categorical data in Python using pandas, it doesn’t calculate statistics like mean, median, or standard deviation. These statistics are typically used for numerical data.

Instead, the describe() function for categorical data in pandas provides information about the categories themselves, including:

count: The number of observations in the dataset.
unique: The number of unique categories present.
top: The category that appears most frequently (the mode).
freq: The frequency of the most frequent category (the number of times the mode appears).

Therefore, the answer is 2. Mode.

ChatGPT

Mode

When you apply the describe() function to categorical data, the results will include descriptive statistics such as the count, unique values, and the mode. The mode is particularly relevant for categorical data, as it represents the most frequently occurring category or value in the dataset. Categorical data does not have a mean, median, or standard deviation, as these measures are applicable to numerical (quantitative) data.

Home » Google Career Certificates » Google Advanced Data Analytics Professional Certificate » The Power of Statistics » Module 1: Introduction to statistics

Module 1: Introduction to statistics

Get started with the course

Video: Introduction to Course 4

Video: Evan: Engage and connect

VIdeo: Welcome to module 1

Video: The role of statistics in data science

Video: Statistics in action: A/B testing

Video: Descriptive statistics versus inferential statistics

Practice Quiz: Test your knowledge: The role of statistics in data science

Descriptive statistics

Video: Measures of central tendency

Reading: Measures of central tendency: The mean, the median, and the mode

Video: Measures of dispersion

Reading: Measures of dispersion: Range, variance, and standard deviation

Measures of dispersion

Formula

Example: Real estate prices

Key takeaways

Resources for more information

Video: Measures of position

Reading: Measures of position: Percentiles and quartiles

Video: Alok: Statistics as the foundation of data-driven solutions

Practice Quiz: Test your knowledge: Descriptive statistics

Calculate statistics with Python

Lab: Annotated follow-along guide: Compute descriptive statistics with Python

Set up a split screen

Tips for working with the follow-along guide

Video: Compute descriptive statistics with Python

Lab: Activity: Explore descriptive statistics

Lab: Exemplar: Explore descriptive statistics

Practice Quiz: Test your knowledge: Calculate statistics with Python

Review: Introduction to statistics

Video: Wrap-up

Reading: Glossary terms from module 1

Quiz: Module 1 challenge

Share this:

Like this: