Skip to content
Home » University of Pennsylvania » AI For Business Specialization » AI Fundamentals for Non-Data Scientists » Week 1: Big Data and Artificial Intelligence

Week 1: Big Data and Artificial Intelligence

In this module, you will be introduced to Big Data and examine how machine learning is used throughout various business segments. You will also learn how data is analyzed and extracted, and how digital technologies have been used to expand and transform businesses. You will also get a detailed look at data management tools and how they are best implemented and the value of data warehouses. By the end of this module, you will have gained insight into how machine learning can be used as a general-purpose technology, and some best techniques and practices for data mining.

Learning Objectives

  • Examine Big Data and how it is best used in identifying issues within a business
  • Assess how different skillsets are needed to manage, understand, and act on Big Data
  • Review the three types of Machine Learning and identify their applications

Video: AI for Business Introduction

Course Summary: AI for Business

This course, titled “AI for Business,” is designed to provide a comprehensive understanding of artificial intelligence (AI) from a managerial perspective. The course is led by Kartik Hosanagar, a Professor of Technology and Digital Business at the Wharton School, who specializes in the digital economy, including topics such as Internet commerce, digital media, digital marketing, and data-driven decision-making. Hosanagar is also the author of “Human’s Guide to Machine Intelligence,” which explores the implications of AI in business decision-making.

Key Course Objectives:

  1. Understanding AI and its Business Applications: The course begins by defining AI as the capability of computers to perform tasks that typically require human intelligence, such as understanding language, reasoning, and learning. It introduces machine learning as a subset of AI, focused on enabling computers to learn without explicit programming.
  2. AI as the Next Phase of Digital Transformation: The curriculum positions AI as the next frontier in digital transformation, following the Internet, cloud computing, and mobile computing. It highlights how companies that embraced these technologies early on were better positioned for long-term success.
  3. AI as a General Purpose Technology: The course discusses the concept of general purpose technologies, which have broad applications across various sectors and can drive economic growth. It presents research indicating that AI, particularly machine learning, exhibits characteristics of a general purpose technology, with a significant number of research jobs spread across multiple industries.
  4. Implications for Business Managers: The course emphasizes that AI’s widespread use and ongoing research imply that managers must be patient and adaptable. It suggests that AI will affect a variety of industries, and businesses must prepare for its transformative impact by understanding the technology, adapting business models, and fostering a culture of innovation.

Course Goals:

  • To equip managers with a strategic framework for leveraging AI to achieve a return on investment.
  • To explore various business use cases for AI and how it can be applied across different sectors.
  • To discuss the organizational changes required to integrate AI effectively, including technological infrastructure, business processes, and cultural shifts.

Conclusion:

“AI for Business” aims to prepare managers and business leaders for the AI-driven future, providing them with the insights and tools necessary to navigate the challenges and opportunities that AI presents. The course underscores the importance of understanding AI’s potential, being patient with its development, and being proactive in adapting business strategies to capitalize on the transformative power of AI.

Welcome. This is a course
on AI for Business. In this course, we will talk about AI from a
managerial standpoint. We’ll look at several
business use cases for artificial intelligence
and we will look at a strategic framework
for managers to use in order to get the
return from AI investments. I’m Kartik Hosanagar, I’m a Professor of Technology and Digital Business
at the Wharton School. My research focuses broadly
on the digital economy. I look at Internet
commerce, digital media, digital marketing, and
data-driven decision-making. I was previously a
co-founder of Yodel, which was a marketing
platform for small businesses and I’ve worked with a number of startups and large companies
over the years, looking specifically
at the applications of AI and Data
Science and Business. I’m also an author of the book Human’s Guide to
Machine Intelligence, which looks at the
implications of using artificial intelligence to make business decisions, both within the
enterprise and outside. Artificial intelligence
or AI is all around us. AI is about getting
computers to do the kinds of things that
require human intelligence. For example,
understanding language, reasoning, navigating
the physical world, learning, and so on. Machine learning is a subfield
of AI that is focused on getting computers to learn without explicitly
programming them. AI is increasingly being seen as the next phase of
digital transformation. Over the years, a number of different digital
technologies have helped enable business
transformation, which is the idea
of organizing or transforming a company’s
activities and processes in order to make use of the new opportunities created
by digital technologies. Now in the late ’90s, the technology that
helped usher in that transformation
was the Internet. A number of companies opened online divisions to help make
use of that opportunity. But unfortunately, a number
of these firms also shut down these online divisions
after the dot-com bust. The few companies that persisted actually benefited
significantly in the long run and the
companies that did not ended up paying
a significant price. In the mid 2000s, Cloud computing helped
usher in a similar change. Here again, a number
of companies started investing in Cloud
computing, but again, a few companies
backed off when they realized that their early
forays into Cloud computing faced a number of challenges
related to security with moving their data to the Cloud or with regulatory compliance. But these companies
that backed off, again, paid the price and
the companies that persisted were the
ones that were well-positioned in the long run and helped create
a certain amount of business agility that has
helped them in the long run. In the late 2000s, mobile computing helped
create a similar change, and the companies
that invested in mobile computing
early helped really create mobile-first and
mobile-only products and helped transform the
businesses into a mobile world. Today, it’s
increasingly appearing like AI will be equally
transformative. In fact, there is early
evidence that AI can be seen as what is often referred to as a general
purpose technology. Now a general purpose technology is a technology that has the potential for widespread use
across a range of sectors, and these general
purpose technologies can stimulate innovation and
drive economic growth. At an organizational level, they can also inform
product strategy and overall design of
the organization itself. Now three factors
are seen as being indicative of
whether a technology is a general purpose technology. The first is that
the technology has widespread use across
multiple industries. The second is that there are large number of research jobs
related to that technology and these research
jobs themselves are spread across a number of
other industries as well. In a recent study
by Goldfarb et al, the researchers looked at whether artificial
intelligence shows promise as a general purpose technology. They looked at a number of recent technologies that have
got attention in the press. For example, machine learning, geographical information
systems, CRISPR, quantum computing,
fracking, robotics, nanotechnology, Internet of
Things, and Cloud computing. They also looked at millions
of job postings and classified these
job postings based on which technology
they were related to. They evaluated whether
machine learning, which is a subfield of artificial intelligence,
look different. As you can see on the slide, the researchers found
that a number of job postings were related
to machine learning, as also a few other technologies like robotics and
Cloud computing. In fact, as many
as 14.6 percent or almost 15 percent
of the jobs for machine learning, were
research-related jobs. Now research-related jobs are particularly important
indicator of a general purpose technology
because they help demonstrate that
the technology is capable of ongoing improvement. This ongoing improvement creates significant future potential, some of which is not currently recognized and certainly
machine learning, which is the most
important sub-field of AI, seems to demonstrate
that capability. The researchers also looked at whether machine learning jobs in particular were spread across a number of different
industries. They did find that
machine learning jobs can be seen in a number
of different industries, primarily in education services, in professional services, and manufacturing, and
financial services, and so on. In contrast, some of the
other technologies like quantum computing
were seen primarily in one or two industries
like professional services. In short, it does seem like machine learning jobs
are available in multiple industries and
multiple industries see value from
those skills today. Next, the researchers looked
at whether research jobs, in particular, were also
widespread across industries. Here again, the researchers found that a number
of industries, including manufacturing,
professional services, information technology-related
jobs, finance, education, all of these had need for research jobs related
to machine learning. Not every other technology
showed this widespread. In short, a number of these
statistics do suggest that machine learning in
particular and AI in general, is likely to be a general
purpose technology. There are many
implications of this. The first is that companies and managers
need to realize that machine learning
and AI will have significant impact on a
wide variety of industries. Just because you’re not
a technology industry, does not mean that
you’re shielded from the transformative impacts
of machine learning. Secondly, the fact that a lot of these jobs
are research jobs, which implies the
technology is evolving, also implies that managers need to be patient
with the technology. The transformative impact of the technology might
come with a luck. Therefore, to effectively make use of these opportunities, managers will need to understand the technology
and its applications, they will need to
make many changes to their business models, to their tech infrastructure, to their organizational
processes, and to their culture as well. All of that requires
significant changes. The purpose of this course
is to help you get there.

Video: Course Introduction

Course Overview

This course covers AI Fundamentals from a business perspective, focusing on Big Data, Artificial Intelligence, and Machine Learning.

Module 1: Introduction to Big Data

  • Defining Big Data and its characteristics
  • Working with Big Data
  • Business questions that Big Data can help answer

Module 2: Introduction to Artificial Intelligence

  • Defining Artificial Intelligence and Machine Learning
  • Relationship between AI and ML
  • Types of Machine Learning methods

Module 3: Machine Learning in Practice

  • Machine Learning visualizations
  • Recent developments in ML, such as Auto ML
  • Simple interfaces for non-engineers and non-data scientists to leverage AI

Module 4: The Role of Data in Building AI Systems

  • Importance of large training data sets for AI systems
  • Challenges for small companies and enterprises without data
  • Strategies for building AI systems without data

The course aims to provide a comprehensive understanding of AI Fundamentals, from Big Data to Machine Learning, and how they can be applied in a business context.

This course will discuss AI Fundamentals
from a business perspective. We’ll begin with
an introduction to Big Data. Specifically, what exactly is Big Data? How does one work with it? And what types of questions, business
questions can Big Data help you answer? Will then move to an introduction
to artificial intelligence. We’ll talk about what is
artificial intelligence? What is machine learning? How are they related and what are the different types
of machine learning methods? Next my colleague Professor Sunny Tampa, we’ll talk about machine
learning in practice. He’ll discuss machine learning
visualizations as well as recent developments such as Auto ML, that allow
non engineers, non data scientists to leverage AI to answer business
questions in very simple interfaces. And finally, I will talk about the role
of data in building AI systems. Specifically, modern AI is built
on large training data sets. Which implies that for companies
to have flourishing AI practices, they really need to have
access to a lot of data. But how do small companies start
with an AI practice without data? Or how do companies in general rollout
AI in their enterprise without data? And we’ll talk about building AI systems
without data in the final model in this course.

Video: Big Data Overview

What is Big Data?

Big data refers to large volumes of data that exceed the capacity of conventional methods and computer systems. It’s not just about volume, but also about variety (structured and unstructured data), velocity (data streaming in at high speed), and veracity (truthfulness of data).

The 3 V’s of Big Data

  1. Volume: Large amounts of data that can’t be stored or analyzed on personal computers.
  2. Variety: Structured and unstructured data, including text, audio, and video data.
  3. Velocity: Data streaming in at high speed, requiring real-time analysis and decision-making.

The 4th V: Veracity

Veracity refers to the truthfulness of data, which is critical in big data due to the variety of sources and potential inconsistencies.

Why is Big Data Important?

Big data is important because it allows managers to ask new questions and answer old questions better. It enables companies to make data-driven decisions, improve operations, and create new business models.

Applications of Big Data

Big data has applications in various industries, including:

  1. Marketing: Analyzing social media data to craft targeted marketing campaigns.
  2. Finance: Detecting credit card fraud in real-time using big data tools.
  3. Healthcare: Analyzing wearable device data to improve consumer well-being.
  4. Transportation: Using sensor data to optimize traffic patterns and route planning.

Conclusion

Big data is a transformative force that is changing the way businesses operate. It enables companies to make better decisions, improve operations, and create new business models. In the next module, we’ll explore machine learning and its applications in various industries.

In this module, we’ll
talk about big data. In particular, we
will start with an overview of what
exactly is big data, we’ll talk a little bit
about what kinds of skills are needed to
excel with big data, we’ll talk about big data
tools and infrastructure, and we will conclude by
talking about data mining and setting up the stage
for machine learning, which we will cover in module 2. To begin with, let’s explore
what exactly is big data. Now, data is certainly a concept that’s been around for
a really long time and there has been
an emphasis on data for several decades now. We hear phrases like, “Data is the new oil. Data is just like crude. It’s valuable, but if unrefined it cannot
really be used.” Futurist John Naisbitt says that we have for the first time an economy which is based on a key resource that is information that is
not only renewable, but it’s also self-generating. Running out of this
resource is not the problem but drowning
in it is the real problem. Now, we’ve heard phrases
like this for awhile. Data has been very important to businesses for multiple decades, but the focus or emphasis on
big data is relatively new. Now, big data, as
the term suggests, is about large volume of data. In fact, the National Institute of Standards and Technology says that big data is data that exceeds the capacity
or capability of conventional methods
and computer systems. Now volume is certainly a
key aspect of big data. But it’s not just about volume. When we talk about big data, we’re talking about data
with different structure, we’re talking about
data that is being created at a different speed, we’re talking about
different kinds of tools to analyze the data
and most importantly, from a managerial standpoint, we’re talking about
different kinds of business questions that
we can answer with that. Now, one way to
think about big data is through the three
V’s of big data; volume, variety, and velocity. Volume of data simply
implies that we’re not talking about terabytes
or petabytes of data. In short, the kinds of
data that won’t fit in our laptops and
personal computers. The kind of data
that we cannot open in Excel and just
start analyzing. That’s what the volume
of data is all about. In terms of variety, we refer to the fact that
we’re no longer talking about structured numerical
data that you can analyze in Excel spreadsheets. Well, we were talking about unstructured data,
meaning text data, audio data, video data, where there’s
intelligence hidden in that data that
we want to extract. In terms of velocity, we’re referring to the idea that data are
constantly coming in. It’s streaming every second
and milliseconds and we need to be able to
perhaps even analyze the data and make
decisions on the fly. That’s what data
velocity is all about. Sometimes when we
talk about big data, there’s a fourth V, which is veracity or truthfulness
of data that comes in. Data veracity refers to the point that data
are coming in from multiple sources and are
not curated as in the past, and so you might
have data coming in from social media platforms, meaning user-generated
content and this content might not
exactly be high-quality data, so we need to account for that. We might also have
inconsistency of data or incomplete data and
so data veracity is also becoming a
fourth issue which is very critical and an
integral part of big data. Now of course, a natural
question to ask is why is this emphasis on
big data so new? Really, it comes
down to two things. The first is computing capacity. Computing capacity has been
growing exponentially. Our ability to store data and process data has been
growing exponentially and that has made big
data tools available today that simply weren’t
available 10 years back. The second is data generation itself is being transformed. In the past, data
was generated in a centralized way
and it was limited. In contrast, today data is being generated in a
decentralized way. There’s lot of user-generated
content that our customers, for example, are generating. There’s data generated
from mobile devices, again, from each
individual user. This data being generated from
thousands of sensors that a company might be using in it’s manufacturing facility
or retail stores. All of these factors are
resulting in an explosion of data and really is all about the
transformation in data. But most importantly,
big data also changes the things
a manager can do. In particular, big data allows managers to ask new
questions that they simply couldn’t ask before and they also help answer the same
old questions better. In terms of the ability
to ask new questions, consider the problem of a
marketing manager that’s trying to design marketing
campaign for a new product. The manager has to decide what product features
to emphasize. If it’s a phone, the manager has to decide whether we
should be talking about the battery life of
the phone or should we instead be talking about the
sleek design of the phone, or should we instead talk about the user interface and
how user-friendly it is, or should we talk about the
brand as such and be talking about our social and
philanthropic initiatives in our marketing campaigns? These are questions that
are hard to answer. In the past, they were
answered partly by gut, partly by small-scale
user service. But now a marketing manager can look at data on social media
platforms and they can look at data on Twitter and Facebook and other platforms and look at what aspects of our products our customers really
appreciating and enjoying. What is it that data on
social media platforms that suggests that
differentiates our brand from other brands? They can use these data to precisely craft
marketing messages. This might not have
been feasible in the past but it’s
feasible through big data that is available on social media platforms that
we can analyze at scale. I also mentioned that
big data allows us to answer the same
old questions better. For example, consider credit
card fraud detection. Credit card fraud is rampant in the financial
services industry and costs these companies
billions of dollars. In the past it was hard to
detect and most commonly, it was detected well after
the fact, for example, a customer might see their credit card statement and conclude that a
certain transaction is fraudulent and might call the customer service center
and flag that transaction, and then it gets corrected, but it’s done after the
fact and often it’s hard to really recover
the lost money. In contrast today
with big data tools, companies can
analyze transactions on the flight right after a customer swipes a credit
card on a terminal. Big data tools can analyze that transaction and determine whether it’s fraudulent or not. This helps not only
detect fraud faster, it also helps do it at scale which simply
was not feasible before and this
is creating a lot of value to financial
services companies. The value of big
data is not limited to just financial
services company. We see applications in a number of industries
like healthcare, education, transportation,
and many more. For example, if you
look at healthcare, there’s a big trend in
wearable devices these days, a lot of consumers are wearing
devices like Fitbit and others and these devices are able to capture
data about heart rate, sleep patterns, exercise, and many more aspects of
our daily lifestyle. This kind of data
ultimately helps consumers take better actions to improve their well-being. Similarly, consider
transportation. There are sensors on
roads that can capture data on traffic
patterns, road closures, accidents and now that
data is being made available to us in real
time on our mobile devices. This helps us plan a route
better, helps in scheduling, and ultimately is the basis of applications like
Google Maps and many other mapping systems many of us use on a daily basis. These are but a few examples
of applications of big data. In fact, later in module 3, we will look at a number of other applications of big data in a variety of industries. We will also look at how machine learning
is being used in these industries to extract intelligence out of the
data in these settings.

Video: Big Data Analysis

Traditional Data Analysis vs. Big Data Analysis:

  • Traditional data analysis is hypothesis-driven, starting with a specific question and testing a hypothesis.
  • Big data analysis is more exploratory, starting with broad business questions and using data to find patterns, relationships, and correlations.

Skills for Big Data Analysis:

  • Managing the data: organizing data for analysis, which may involve buying tools or hiring data experts like data architects or chief data officers.
  • Understanding the data: using data science tools to extract intelligence from data, including statistics, machine learning, and data mining.
  • Acting on the data: applying insights to make managerial decisions, requiring data skills and domain expertise.

Data Skills for Managers:

  • Interpreting and understanding data analysis
  • Challenging data analysis when necessary
  • Understanding the limitations of data analysis

Domain Expertise for Managers:

  • Asking the right questions to turn data insights into action
  • Having relevant domain expertise to marry data insights with managerial action

Tools for Big Data Analysis:

  • Data management tools: collecting and organizing data within the company
  • Data analysis tools: extracting meaningful, managerially actionable intelligence from data

The lecture emphasizes the need for new skills and tools in big data analysis, including data management, data science, and domain expertise. It also highlights the importance of managers having both data skills and domain expertise to effectively act on data insights.

In this lecture, we’ll talk about what
makes big data analysis different from traditional data analysis. And in turn what does that mean in
terms of the kinds of skills and tools companies need
within their organization. Now, traditional data analysis
is often very structure. It might start with the managerial
question which might result in a hypothesis that is posed by
a statistician or a data scientist. The goal then is to analyze the data
in order to test that hypothesis. The data analysis confirms or
suggest that our hypothesis is incorrect. In short,
it is very much hypothesis driven. In contrast big data and
analytics is far more exploratory. It starts by looking at data not
necessarily with the specific hypothesis. But with a broad set
of business questions. We might conduct more
exploratory analysis and find certain patterns or relationships or
correlations in our data. That might suggest certain insights,
business insights. And sometimes that might in fact
even lead to certain hypothesis. And we might then conduct more
former hypothesis testing or traditional analysis on that. In short, big data analysis is about being
more iterative, being more exploratory. And essentially it’s a process where data
often leads the way as opposed to our hypothesis. Big data analysis also needs
a new set of skills or capabilities within the organization. I tend to think of these skills in
terms of three main types of skills. Managing the data, understanding the data,
and acting on the data. Managing the data is all
about organizing data so that it can be analyzed subsequently. Sometimes this involves
buying 3rd party tools and often it’s these tool developers that
focus on how best to manage data. And they provide us nice solutions
that we can buy off the shelf. But sometimes within a company, we also need data experts who can
help manage the data internally. These might be in the form
of data architects or chief data officers who might
set data governance policies. Who might also figure out the architecture
of how our data is going to be organized either on premise or in the cloud. Understanding the data is all about using
tools to extract intelligence from data. This is broadly the domain
of data science. It includes statisticians who often
conduct traditional data analysis. It also includes machine learning and
data mining experts. Who might apply more modern
techniques from computer science in order to analyze the data. It also includes the ability
to visualize data. Because often one of the key
abilities in data science is not just being able to analyze the data. But it’s also being able
to construct stories and visualize it in meaningful ways. So that the insights can be easily
consumed by all stakeholders. The third set of skills
is acting on the data. This is where managers come in. It requires managers to apply
insights from the data analysis. And apply them to make
managerial decisions. It requires two kinds of skills,
the first is data skills. Managers need to be able to interpret and understand what data
scientists are telling them. They need to be able to challenge the data
analysis when appropriate because data insights can also be misleading. Sometimes data analysis can
find spurious correlations. And acting on them without really
challenging the data analysis can be problematic. So it really does require managers to have
a basic understanding of data science. To understand what
are the limitations of data analysis. To appreciate when data
analysis is correct and when it needs to be modified or
challenged appropriately. But the second related skill that
managers need is domain expertise. Data is often telling us what
patterns we see in the past. But it requires managers who
have relevant domain expertise. To ask the right questions to figure
out how do we go from the data insights to manage the real action. So the most successful
managers in a data world. Our managers who
are simultaneously data savvy and also have strong vertical
domain expertise. They can marry the two together
to drive managerial action. Lastly, big data analysis not only
requires new sets of skills within the organization. It also requires new sets of
tools within the organization. In terms of tools we can think
about two kinds of tools. The first is data management tools,
which is essentially about tools that help us collect and
organize all the data within the company. And the second is data analysis tools. These are tools that help
us analyze the data and extract meaningful managerially
actionable intelligence from that data. In the next lecture, we will dive
into the data management tools.

Video: Data Management Tools

Data Warehousing

  • A Data Warehouse is a specialized database management system that stores historic data from multiple sources in an enterprise.
  • Its purpose is to provide a single point of access for all data in the company, serving analytics needs.
  • Examples of Data Warehouses include Microsoft’s Azure SQL Warehouse, Google BigQuery, Snowflake, and Amazon Redshift.

How Data Warehouses Work

  • Operational data is pulled from various sources (e.g., CRM, ERP, billing systems) using ETL (Extract, Transform, Load) tools.
  • Data is transformed and loaded into the Data Warehouse, providing a unified view of all data in the company.
  • Reporting and data visualization tools, such as Tableau, can be built on top of the Data Warehouse.

Value of a Data Warehouse

  • Provides a single point of access for all data in the company.
  • Serves as a repository for historical data, separating operations from analytics.
  • Ensures data quality and provides a comprehensive view of data for analytics queries.

Big Data Tools

  • Hadoop and Spark are big data tools that serve two main purposes: storage and processing.
  • They store massive amounts of data in a distributed fashion across multiple computers or nodes.
  • They process data in a distributed and parallelized manner, increasing speed.

Hadoop and Spark

  • Hadoop is an open-source tool offered by the Apache Foundation, with Cloudera being a popular distribution.
  • Spark is a more recent and dominant replacement for Hadoop, solving some of its limitations.
  • Databricks is a company built around Spark.

Before your company can use your data and begin
with AI initiatives, it is important to have the
data infrastructure in place. In this lecture, I’m
going to talk about data management tools
that are necessary for companies to have in
place before they can embark on large-scale
AI initiatives. First, we’ll talk a little
bit about Data Warehousing. To begin, many of you are likely familiar with the
concept of a database. Database is quite simply a
structured collection of data. Quite simply an
Excel spreadsheet can be thought of as
a type of database. Now in practice, we need usually better tools
to manage data. Database management
systems, or DBMSs, are systems that allow users to better access and
manage the database. Excel again provides some
simple functionality, but more advanced databases
from Microsoft and Oracle and many other companies really help companies better
manage their data. Sometimes we refer to database management systems
quite simply as a database. A Data Warehouse is a particular database
management systems. It’s specialized in two ways. First, it’s specialized
in terms of the type of a Data
Warehouse stores. Usually that is historic data from many sources
in the enterprise. Data warehouse is also specialized in terms of
the purpose it serves, and that is Analytics. A usual database might
serve operations. For example, when a customer of a bank logs into
the website and wants to look up their
current account information, then you actually
interact or that customer is interacting with
an operational database that is able to pull
data very fast and respond to customer queries
like their current balance. In contrast, analytics
needs access to all of the data that a company might
have or most of it. The purpose there is
usually not speed, but it’s the ability
to have a more comprehensive and more of a bird’s eye view of all
the data in the company. A Data Warehouse
serves that purpose. It’s not necessarily
the fastest database, but it’s specialized for the
function of Analytics and thus provides a more
complete picture of the data in an organization. Examples of Data
Warehouses include Microsoft’s Azure SQL Warehouse, google BigQuery, Snowflake,
and Amazon Redshift. Now let’s talk a little bit about how Data Warehouses work. Usually in most companies, operational data is sitting
in many different places. For example, customer data might be sitting
in a CRM system. Some other enterprise
information, including information
about partners and supply chain may be
sitting in an ERP system. Customer billing information
might be sitting in another separate database. Now if we want a unified view of all the data in the company, we first need to pull all of that data into
a Data Warehouse. ETL tools are useful for that. ETL stands for extract,
transform, and load. These tools pull the data out of the different
individual databases. For example, they’ll pull the customer data out
of the CRM system, the customer’s billing data out of the billing
system and so on. All of that data is pulled out, it’s transformed as needed and then loaded into
the Data Warehouse. Popular ETL tools
include tools built by companies like
Informatica and stitch, which is now part
of a company called talent and many others. The Data Warehouse
now has all of the data from all these
different sources. Once we have this
data in one place, you can now build
Reporting and data visualization tools
on top of that. For example, business
intelligence tools like Tableau sit on top of
the Data Warehouse. When an analyst enters a query, these systems can then go into the Data Warehouse and pull
the necessary information. Next, let’s talk about the
value of a Data Warehouse. The main purpose or value of
a Data Warehouse is that it serves as a single point of access for all data
in the company. It’s where a history
of all the data is stored and as I
mentioned earlier, a Data Warehouse helps separate operations
from Analytics. Usually the operations data is made to be fast so that
when a customer logs in, then you can pull the
data fast and respond to information such as the
customer’s balance. On the other hand,
certain Analytics queries might require a more comprehensive access to historical data and an
assurance of data quality. For example, if an
analyst wants to know how much revenue has each
product line brought in over the last 10 years and
we want that data broken out by month
and by city and state. That query requires
access to a lot of historical data over
the last ten years and the Data Warehouse
provides that assurance of Data Quality and
that single point of access to all of that data. Now, that’s a little bit
about Data Warehouses. As part of Data Infrastructure. We should also talk about big data tools such
as Hadoop and Spark. Now Big Data tools
like Hadoop serve two main purposes,
storage and processing. Now Storage of big data usually has some
unique challenges. If we want to store a
little bit of data, a few files, we can typically store that in our computers. But what if there’s
massive amounts of data, data for millions or
hundreds of millions of customers over the
last 10, 20 years. That kind of data cannot be
stored in a single computer. One of the things that Big Data tool like
Hadoop does is that it stores it in a
distributed fashion across multiple computers
or multiple nodes. Next, these systems also take care of
processing that data. Usually that processing again involves distributed processing of that data across
multiple nodes or across multiple machines. Also parallelizing
the computations or data processing
as much as possible, which helps increase speed. Hadoop is an open source tool that is offered by the
Apache Foundation, which is a non-profit foundation that provides open
source software. The most popular distribution of Hadoop is by a company
called Cloudera, although there are
several others. Spark is a more recent version, or in fact, I would say, a more dominant
replacement for Hadoop, which serves similar
purposes but solves some of the problems that Hadoop
had faced in the past. Databricks is the
most dominant company that is built around Spark. We’ll next talk a
little more about Data Warehouses and
also Big Data tools like Hadoop and Spark in our discussion with an
executive from Snowflake.

Video: Data Management Infrastructure

Main Topic: Data Infrastructure for AI-Driven Business Transformation

Guest: Chris Child, Director of Product Management at Snowflake

Key Points:

  1. Companies need two types of databases: transactional databases for day-to-day operations and analytic databases for processing large sets of data over a long period of time.
  2. The evolution of data infrastructure has moved from custom-built systems to Hadoop and now to cloud data warehousing.
  3. A data infrastructure includes a set of ingest tools, transformations, and query and visualization engines.
  4. Companies should think ahead of time about what problems they want to solve with data and what data they need to answer those questions.
  5. Having a data infrastructure in place is crucial before using machine learning or predictive technologies.

Key Takeaways:

  • Companies need to have a data infrastructure in place before embarking on large-scale AI-driven business transformation.
  • There are two types of databases: transactional databases and analytic databases.
  • The evolution of data infrastructure has moved from custom-built systems to Hadoop and now to cloud data warehousing.
  • A data infrastructure includes a set of ingest tools, transformations, and query and visualization engines.
  • Companies should think ahead of time about what problems they want to solve with data and what data they need to answer those questions.

Actionable Advice:

  • Think carefully about the types of questions you wish you could ask, but can’t because you don’t have all of the data.
  • Identify the types of questions you are answering today, but it’s taking a long time.
  • Think about what data you need in order to answer those questions.
  • Focus on collecting the most important pieces of data that are critical to your business.

Welcome back. In
this session we’re going to talk about the
data infrastructure that companies need to have in
place before they can embark on large-scale AI driven
business transformation. To help us understand
what kinds of data infrastructure companies
need to have in place, we have with us Chris Child, who is a Director of Product
Management at Snowflake. Chris, welcome and please tell us a little bit about
your background in the space. Thanks Kartik,
excited to be here. As you mentioned, I work
at Snowflake Computing, which is a Cloud data
warehouse company. I’ve spent my career
working in data, both as an investor and
now as an operator. Helping build systems
that help companies make better decisions and really
run their businesses better. Chris, I think a good
place for us to begin is to talk about what
exactly is a database? How exactly are companies thinking about all the
different kinds of databases? What does it mean that evolution over the last several years? Sure. There’s
really two types of databases that most
companies end up needing. The first is what we call
transactional database. This is a system
that keeps track of the important information that’s running your business
on a day-to-day basis. If we take a bank, for example, a bank would have a transactional
database that keeps track of the balances
for all the customers, and you’d use that every time someone starts a transaction
to figure out if there’s enough money
in their account to debit their account or credit their account and keep the running
balance going on. These are very useful, and
they need to be very fast, and they tend to
be very expensive. On the other hand, you
end up with what’s called an analytic database
or an analytic system to process
much larger sets of data over a long
period of time. Continuing with
the bank example, I might want to keep a
history of every transaction and every balance that each
of my customers has ever had. It’ll be very
expensive to keep this in my transactional database. I move that data to
a separate database, to an analytic
database where I can keep these massive
long histories. Then I can ask questions like, “I’d like a list of all of my customers whose balanced grew by at least 10 percent in
four of the last five years.” My transactional database
won’t be able to answer that, but my analytical database will. To transition to
analytics databases clearly would involve
investments in infrastructure. What kind of infrastructure
are we talking about? When you originally set up a data warehouse or an
analytics database, you needed to buy a
special hardware. You needed to buy very
expensive special software from a variety of
different vendors. Again, we’re talking
about 20,30 years ago, is where this methodology of storing your data
came into play. As the amount of data
that people were collecting from lots
of different sources, whether that be from mobile
apps, from websites, from marketing campaigns,
or even from data you’re collecting physically in your store about
what’s happening, or from your entire
supply chain. There’s a lot of
different sources of data that started coming in. Those types of specialized
analytics databases that ran on special hardware, become very expensive to operate and started
not really being able to keep up with
the performance needs of that massive
amount of data. This time we went
through the first big evolution of this. From these custom-built
specialized Analytics data warehouses, the massive amount of
data started getting stored in a new
system called Hadoop. Hadoop was developed
by Google to process the massive amounts of web data that they collect and track. Was also designed to run on a giant network of very
inexpensive hardware. Instead of these specialized, very expensive
servers, you could run on hundreds of
very cheap servers. The result was this was a much more
cost-effective way to manage and process these
massive amounts of data. Now, isn’t that really what creating a data lake all about? Also aren’t we in a process
of seeing many companies transition to newer
Big Data tools or technologies like
Spark and others? Absolutely. The data lake
is what people use to refer to basically massive sets of hard drives that they’re
storing all of this data in. It’s a place you can pour huge amounts of
data like a lake, and then you can use tools
like Hadoop or now Spark, which is a much
more modern version of the Hadoop
computation engine, to pull data out of that lake, run some calculations
and transformations on it and then put it back so
that you can find it later. Traditionally, what we
saw is once people would sort of finish all
those calculations, they wanted to be able to
query that data very quickly. They would end up
putting that into those data warehouses that
they were using originally. Now they started to refer
those as data marts, which was where a small set
of your customers or of your internal users could go
get a subset of the data. But in order to get a
new data-set loaded, you had to go back and write
Hadoop or Spark jobs and get that data transformed
and loaded into those data marts or
data warehouses. For those of you who are not
familiar, Hadoop and Spark, these are techniques for storing and processing
large amounts of data. Essentially they involve
distributed storage, distributed processing
of the data, and creating lot of
parallelization which helps the data processing
to happen faster. Now, coming back, Chris, we’ve also now seen a transition towards
Cloud data warehousing. Can you set up what exactly
is a data warehouse? How does it fit in within this whole conversation
of companies moving the data to data
lakes and data marts? Absolutely. A lot of people found that with these
data lakes and data marts, it was still hard to keep
track of all of your data. It was in different places that were massive amounts of it, it was in inconsistent formats. Accessing it often involves having your engineering
team actually write code that could run on these large parallelize systems. About 10 years ago, a lot of research
started happening into what are now called
cloud data warehouses. These are systems from Amazon or Google
or from Snowflake, which are a re-imagining of the traditional
data warehouse. They’re designed to run on massively parallel sets of inexpensive hardware,
like Hadoop. Generally, they’re run on
hardware that you rent from cloud providers like Amazon
or Google or Microsoft, instead of having to manage
those servers yourself. But from the outside, they look and operate and have the performance of a
traditional data warehouse. What that means is they use a language to speak
to them called SQL, which is what the data
warehouses and databases use. This means you can natively use Tableau or Looker
or other analytics and BI tools right
on top of them. Because they use that
standard language, they also integrate well
with large sets of tools. As we were talking a
little bit about before, what you really want
from your data platform overall is somewhere to
store all this data. You need a set of ingest tools. How do you get the data
into your data platform? Being SQL based,
you can use any of a wide variety of tools that are built specifically for that. You then need a set
of transformations. It take the raw
data that’s coming in and turn it into
something useful. As I’m sure you’ve talked
about in this class, one of those techniques
is machine learning that you can use to
take this raw data and score it and make predictions and figure out
what’s going to happen. But there’s also simple things like I might be getting data about the set of actions that users taking on a daily basis. Really I want to look at
that on a monthly basis. One transformation would be rolling that up to
a monthly basis. The final piece you need is a query and
Visualization Engine, as we mentioned,
Tableau or Looker, other tools like that. A way to actually run queries and for your analyst
team to build dashboards and basically ask questions of the data once it’s
been transformed. One of the big challenges
that people had with Hadoop or even with the
Spark based ecosystem, is that those tools often need to be custom-built
for that ecosystem. Whereas, if you use a
Cloud data warehouse, you get high-performance, you get the
scalability of Hadoop, but you also get access to the standard ecosystem of tools. Chris, when we started
our conversation, I mentioned that before
companies can start using machine learning or other
predictive technologies, they need to have a data
infrastructure in place. Now, putting this data
infrastructure in place obviously costs some money and
cannot be taken lightly. What questions should a manager ask before they embark
on such an exercise? Absolutely. One of the mistakes that I’ve seen people
make repeatedly is to think that having this Data Infrastructure
in and of itself is an important
thing to do. They’ll set this up and
the load a bunch of data and they’ll buy
a bunch of tools, and then they won’t actually
get any value out of it. Because what they didn’t
do was think ahead of time about what they
were trying to solve, what problems they had that they wanted to solve with data. What I would suggest
as anyone who’s going to undertake this journey, think first carefully about the types of questions that
you wish you could ask, but you can’t because you
don’t have all of the data. The types of questions that
you are answering today, but it’s taking a long time. An example of that
is anything where you ask someone on
your team to go spend two weeks collecting data and running
analysis in Excel. Those are decent candidates for the types of
problems that you could solve in minutes if you have the correct data
infrastructure in place. Finally, think about
what data you need in order to answer
those questions. It’s generally not
that useful to go collect every single piece of data you can possibly think of. Instead, what are the pieces of data that are important
to your business and are going to help you answer those critical business
questions so that you can run your business better
and more efficiently? Is really at the end of the
day, that’s the whole goal. Chris, that has been very
helpful set of depths and overview of the data infrastructure companies
need to think through. Thank you so much
for joining us. Thank you, Kartik. I appreciate it.
Thanks for having me.

Video: Data Analysis: Extracting Intelligence from Big Data

Data Mining vs. Traditional Statistical Regression

  • Data mining is a broad term that refers to tools for discovering patterns in large data sets
  • Traditional statistical regression starts with a hypothesis, whereas data mining is a data-driven exploration that may not start with a hypothesis
  • Data mining techniques include clustering and association rule mining

Clustering

  • Clustering is a data mining technique that groups similar data points together
  • Clustering can be used to determine customer segments in a data-driven manner
  • Example: clustering can be used to identify purchasing patterns of suburban soccer moms and compare them to other customer segments

Association Rule Mining

  • Association rule mining is a data mining technique that finds common co-occurrences in data
  • Example: analyzing shopping cart data to find patterns, such as people who buy bread and butter also tend to buy milk
  • Another example: Don Swanson’s analysis of Raynaud’s disease, which found that EPA (eicosapentaenoic acid) is associated with reducing blood viscosity and strengthening the musculoskeletal system, leading to a hypothesis that EPA can help treat Raynaud’s disease

Predictive Analytics

  • Predictive analytics involves using data to make predictions about the future and take action based on those predictions
  • Examples of predictive analytics include:
    • Predicting demand for a product and making production decisions based on that
    • Predicting whether a transaction is fraudulent or not
    • Recommending products to customers based on their browsing and purchasing history
    • Detecting fraudulent transactions in real-time

In this lecture, we’re going
to talk about data analysis. In particular, we’ll begin
by discussing data mining. Now, data mining is a
broad term that refers to tools for discovering
patterns in large data sets. To understand what
exactly is data mining, it’s useful to contrast data mining from something
many of you understand, which is simple
statistical regressions. Now, when we are
conducting regressions, we might start off
with a hypothesis. For example, we’re trying to understand what are the
factors that predict whether a customer might default and not pay
their credit card dues. We might come up with the
hypothesis that the risk of default depends on a
number of factors, such as whether they’ve
defaulted in the past. That is, we have a hypothesis
that people who have defaulted previously are
likely to default again. We might also have a hypothesis that people who have
a large number of credit cards are
likely to default because they are
perhaps struggling to manage their finances. Lastly, we might have
a hypothesis that people who are employed might
be less likely to default. Now the goal of the
regression might be to test these hypotheses. We might run a regression
based on past data, where we test what is
the risk of a person defaulting and whether it
depends on these factors, meaning prior default, number of credit cards, and whether they’re
employed or not. The regression tells us whether these factors
matter or not, and it also tells us
how much it matters. On the slide, you
see that whether a person defaulted
in the past or not has an impact on whether they default
again in the future, and the regression coefficient 0.93 tells us how
important it is. Now notice that all these
important variables, like number of credit cards, whether they’re employed or not, whether they’ve defaulted
in the past or not, these came from a
hypothesis the analyst had. That is at the heart of
traditional data analysis, that is at the heart of
regression testing as well. Data mining, in contrast, is more about
data-driven exploration. It may not start
with a hypothesis, as I previously mentioned. There are a number of
different techniques that are part of data mining. In fact, data mining is a catch-all term for a
number of these techniques. I will not go over all of the techniques that are
part of data mining because really there are a very large set of
such techniques. But I’ll go over a
couple useful examples. The first one is clustering. Clustering is a data
mining technique that is used to group our data. Essentially, clustering
will break up our data into a bunch of
smaller groups or clusters, such that data points within a cluster are similar
to each other, and data points in
different clusters are different from each other. A classic application of
clustering might be in determining customer
segments in our data. The old way of doing
customer segmentation might be from your gut. A marketing manager might say
based on their experience, that we have three kinds
of customer segments, and they might describe
these customer segments in terms of some customer
demographics like they might say one of those
customer segments might be soccer moms in families of four or five people who live in suburban places, and that might be how they might articulate what one of their
customer segments is like. In contrast, when
you use clustering, we’re trying to figure
out the customers and the customer segments from a data-driven manner
without this hypothesis, and clustering might
either validate the gut of the manager
and might indeed indicate and show that the
purchasing patterns of suburban soccer moms are different from the purchase
patterns of other customers. Or it might suggest that the
differences are not that important and maybe there’s a different way we should be thinking about
customer segments. Another data mining tool is
association rule mining. Association rule mining is a data mining technique that finds common co-occurrences
in the data. For example, we might analyze shopping cart data or purchase patterns of
customers at a grocery store, and we might look at
common patterns in there. Association rule
mining software might find a pattern such as people who tend to buy
bread and butter in a transaction also tend to buy milk in that
same transaction. If we find this transaction, we might take
action based on it. For example, a traditional
grocery store meaning a brick and mortar grocery store might decide to stock bread, butter, and milk close by. Or an online grocery store might decide that if a customer has already added bread and butter to their shopping cart, then it’s going to
make a recommendation to the customer
to also add milk. There are many applications of association rule
mining techniques in business data in order to
find patterns in those data. Another example might be
applications in health care. One example that comes to
my mind is an analysis of Raynaud’s disease that was
done by a computer scientist or information scientist by
the name of Don Swanson. Don Swanson was interested in
studying Raynaud’s disease, which is a syndrome that affects the
musculoskeletal system. He was in particular
interested in identifying novel treatments
for Raynaud’s disease. Because at that
time there were not very many known treatments for Raynaud’s disease
or Raynaud’s syndrome. In order to answer that, Don Swanson looked
at a number of research papers on
Raynaud’s disease and found what kinds of concepts are associated with
Raynaud’s disease. In other words, what are the common co-occurrences with
the term Raynaud’s disease? He found that blood
viscosity is a term that often co-occurs with a
discussion of Raynaud’s disease. He also found that
musculoskeletal issues are often discussed in articles that talk about
Raynaud’s disease. For example, he found that articles that talk about
Raynaud’s phenomenon or Raynaud’s syndrome talk about an increase of blood viscosity
during Raynaud’s syndrome. Next, he asked what kinds
of other concepts are commonly co-occurring
with ideas such as blood viscosity and
musculoskeletal weakness. He found one concept which is EPA or eicosapentaenoic acid, which was commonly discussed along with blood
viscosity along with musculoskeletal weakness
and along with a number of ideas that are associated
with Raynaud’s disease. For example, he found
phrases such as EPA or eicosapentaenoic acid
helps reduce blood viscosity. In contrast, Raynaud’s disease
increases blood viscosity. EPA is also associated with strengthening the
musculoskeletal system. In contrast,
Raynaud’s disease is associated with weakening of
the musculoskeletal system. Based on this, Don
Swanson came up with a hypothesis that EPA, which is found
commonly in fish oil, can help treat
Raynaud’s disease. Indeed, later clinical
trials showed that fish oil is an effective
treatment for Raynaud’s disease. Now I should clarify here that Don Swanson did not use
association rule mining software. Instead, he used the same
idea and did it manually. But in his later research, he talked about
how his scientific process could perhaps be
automated using tools that are finding common
co-occurrences in data. This is what is at the heart of association rule
mining software. Now data mining techniques, such as clustering and
association rule mining, ultimately are about
finding patterns in data. The next step beyond just
finding patterns is to perhaps make predictions about the future and take
action from it. For example, can we
predict demand for our product in the
future and figure out production decisions
based on that. Can we predict whether a transaction that just
happened is fraudulent or not? That is where the domain of predictive analytics comes in. Let’s look at a
couple examples of what we can do with
predictive analytics. Let’s look at a large retail
company, such as Amazon. A customer might
visit the website. They might actually look and
or browse at a few products. They might eventually pay for these products and the items
are shipped by Amazon. Now the goal of
the retailer is to convince the customer
to buy the product. Often, retailers like Amazon will show recommendations
to the consumer. For example, recommendations such as people who
bought this also bought this or people who viewed this product also
viewed that product. At the heart of these
recommendations is an attempt to try and figure out what kinds of products might this customer
be interested in? That ultimately is
hoping to convince the customer to buy a product. That’s an example of a predictive analytics
application that is trying to predict what kinds of product a customer
might be interested in. Another example might be that when a customer is
ready to buy the product, they might enter their
credit card information and hit Purchase now or Buy now. At this point, algorithms at the retailer’s website
have to figure out whether this is a
legitimate transaction or not. In particular, whether
the credit card is a legitimate credit card that is owned by this
customer who’s placing the order or is it likely
that this is stolen? Here, predictive
analytics techniques look at past data and try to predict whether this transaction is
fraudulent or not. Ultimately, this is
just one example of predictive
analytics in retail. Indeed, there are many
applications of these approaches. In the next module, we will look at these predictive
analytics techniques. In particular, we’ll look
at machine learning as a tool to make predictions that are
managerially actionable.

Video: Introduction to Artificial Intelligence

Artificial Intelligence (AI)

  • AI refers to the development of computer systems that can perform tasks that normally require human intelligence
  • There are different types of AI, including weak AI (narrow intelligence) and strong AI (artificial general intelligence)
  • The goal of the field is to build strong AI that can perform any intellectual task that a human can

History of AI

  • The field of AI was founded by Alan Turing, who proposed the Turing test to determine whether a machine can think
  • The first workshop on AI was organized by John McCarthy in 1956
  • The term “Artificial Intelligence” was coined by John McCarthy and has since become a widely recognized term in the field

Machine Learning

  • Machine learning is an alternative approach to building AI that involves giving computers the ability to learn from data
  • This approach is in contrast to traditional expert systems, which rely on explicit programming of knowledge from experts

Limitations of Expert Systems

  • Expert systems have limitations, such as Polanyi’s Paradox, which suggests that there is a lot of tacit knowledge that is not explicitly known
  • This limitation has led to the emergence of machine learning as an alternative approach to building AI

Examples of AI

  • IBM’s Deep Blue, which beat the world chess champion in 1997
  • IBM’s Watson, which beat human champions on Jeopardy! in 2011
  • Google’s AlphaGo, which beat the world Go champion in 2016

Machine Learning

  • Machine learning is a way to build AI that involves giving computers the ability to learn from data
  • This approach has been successful in building AI systems that can perform complex tasks, such as playing Go and diagnosing diseases.

In this module,
we’ll talk about artificial intelligence, we’ll begin with a brief overview of AI. We’ll then dive into a sub-field
of AI known as machine learning. We will start with the high level view
of what exactly is machine learning. And then, we’ll dive into some
specific machine learning methods. With that, let’s start by talking about
what exactly is artificial intelligence. Artificial intelligence or AI is a term
that refers to the development of computer systems that are able to perform tasks
that normally require human intelligence. Such as understanding language, reasoning,
speech recognition, decision making or navigating the visual world,
manipulating physical objects and such. When we talk about artificial
intelligence, there are many kinds of AI, for example,
one can think about weak AI and strong AI. Weak AI, also known as artificial
narrow intelligence is the kind of AI that is very
good at a very specific task. For example, you might have a
chess-playing AI I that can probably beat the world’s best chess grandmaster,
but it is only good at that one task. The same AI probably
cannot converse with us, it probably cannot recognize images and
so on. Similarly, you might have AI that is
good at product recommendations, but is not good at chess or
recognizing images. In short, these are AI that
are good at one narrow task, most of the AI around
us tend to be weak AI. But the goal of the field is eventually
to build what is known as strong AI or artificial general intelligence. This is a computer program that could
do all intelligent things that a human can do. And so this kind of AI would
be truly intelligent and would be close to a human being
at a wide range of tasks. And finally, you have the notion
of artificial super intelligence, this is an AI system that is a strong AI. It’s as good as humans at a lot of tasks
but it has the ability to leverage its computational resources to store more
data, to analyze the data faster and make decisions faster and therefore
can perhaps beat humans at many tasks. And that is the idea of
super intelligence or AI that is better at humans at most tasks. The history of AI is very recent, the
field owes its origins to a paper written by mathematician Alan Turing,
who asked the question can machines think? Turing had the contention
that machines can be constructed which can
simulate human mind very closely. In fact, he proposed a test which
is known as an imitation game, or also popularly known as the Turing
test for machine intelligence. In the test, a human judge interacts
with two computer terminals, one of the computer terminals
is controlled by a computer and the other terminal is
controlled by a human being. The judge interacts and has a conversation with each of
these through the computer terminal. If the judge cannot distinguish between
the human being and the computer system, then that computer system is said
to have passed the Turing test. Now, when Alan Turing proposed
the Turing test and posed the question, can machines think,
it created a lot of interest in the field. And it led to one of the first workshops
in the field which was a summer workshop on artificial intelligence that
was organized by mathematician, John McCarthy and was attended by
several other luminaries of the field. At this workshop, the scientists laid
the foundations for a field that became known as AI and in fact also coined
the term AI or artificial intelligence. Computer scientists, Pedro Domingos
believes that calling this field AI made it very ambitious, but it also helped
inspire many people to enter the field and that has been responsible for
a lot of progress that the field has made. Now, a lot of the early attention
in AI often was focused on whether AI could beat human beings at games. For example, in 1997, IBM created
a chess-playing computer called Deep Blue, which ended up beating the world
number one chess player at the time, Gary Kasparov three and
a half points to two and a half points. This system had no machine learning in it, meaning it was not capable of learning
on its own without being programmed. Its edge relative to human players
came from its brute computing power, its ability to analyze more than
200,000 moves per second and figure out the best possible move. In 2011, IBM created IBM Watson
which beat Ken Jennings and Brad Rutter who were two of the best
all time players of Jeopardy. IBM’s Watson had machine learning in it,
which was capable of understanding language, meaning understand
the question that’s being asked, that was able to retrieve information
from a large database of information and then answer the question
that was posed to it. More recently, Google created software
known as AlphaGo to play the game of Go. Go is a strategy game like chess but
is much more complex than chess, which implies that brute computing power alone
is not sufficient to beat a human being. You require something more
than brute computing power and you require the ability to learn and
is a better yardstick for intelligence. Google used some of the latest machine
learning techniques in creating AlphaGo. And AlphaGo had great success
in playing human beings and in fact beat the World Go champion,
Lee Sedol. There are many ways to build
artificial intelligence, now, the old way of building AI is an approach
known as knowledge engineering or also now referred to as expert systems. This is the idea of programming
knowledge or capturing and transferring knowledge
to the computer system. For example, if we wanted to build
software to diagnose diseases, we might interview doctors and codify
the rules they use to diagnose diseases. For example,
a doctor might tell us that if a person or a patient has had fever for
over a week and they have body aches and chills, then they might start to
consider antibiotic treatment. Now, that’s a rule that
they might give us and we might program many such
rules to diagnose diseases. Similarly, if we wanted to drive cars, we
might interview thousands of drivers and ask them, what are the rules
they used to drive cars? And they might give us rules such as
when the car in front of us slows down, we might apply the brake and
slow down ourselves. If the car in front of us is going very
slowly, might change lanes and so on. Now, ultimately, we can create reasonably intelligent
systems using these kinds of techniques. And in fact, we have found over time
that expert systems do reasonably well. But over time, we have also observed
that expert systems are often not able to beat human beings at complex
tasks that require intelligence. For example, a system used to diagnose
diseases can do reasonably well, but it cannot often beat doctors in
terms of diagnosing diseases as well. This is because of a limitation that’s
referred to as Polanyi’s Paradox. Polanyi was a mathematician who came
up with the idea of tacit knowledge, which is the idea that we have a lot
of knowledge that we are not aware of. For example, when you ask a person, what are the rules
they used to drive a vehicle, they might be able to give us a number
of the rules that they can think of. And those rules are useful,
but at the same time, they’re not sufficient because there’s
a lot of knowledge we all have that we implicitly apply when we’re driving. But we’re simply not aware of
some of these principles that we apply with driving. And so as a result, asking people to
give us all the knowledge they have gets us a good amount of information,
but because of tacit knowledge, it doesn’t give us all the information. This is why an expert system to
diagnose diseases often cannot beat real world experts. This is why a driverless car created
using knowledge engineering or through an expert system approach ultimately
cannot drive as well as human beings. This has led to the emergence of
an alternative approach which is known as machine learning. Which is the idea that instead of
explicitly programming computers with knowledge from experts, we can instead
give them the ability to learn from data. And hopefully, they can observe
the action taken by experts and mimic that action over time. And that is what we will
turn to in the next lecture.

Video: Machine Learning Overview

The lecture discusses machine learning, a subfield of artificial intelligence that focuses on how computers can learn from data without being explicitly programmed. Machine learning is used for prediction tasks, such as predicting whether a transaction is fraudulent or not, determining whether an email is spam or not, and recognizing speech.

There are three types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.

Supervised Learning: In supervised learning, the algorithm is trained on labeled data, where the input and output are clearly defined. The goal is to learn from the labeled data and make predictions on new, unseen data. Examples of supervised learning include image classification, speech recognition, and sentiment analysis.

Unsupervised Learning: In unsupervised learning, the algorithm is trained on unlabeled data, and the goal is to find patterns or structure in the data. Examples of unsupervised learning include clustering, anomaly detection, and topic modeling.

Reinforcement Learning: In reinforcement learning, the algorithm learns by taking actions and observing the consequences of those actions. The goal is to learn a policy that maximizes a reward signal. Examples of reinforcement learning include game playing, robotics, and autonomous vehicles.

The lecture also discusses the importance of high-quality data in machine learning, and how reinforcement learning can be used to balance exploration and exploitation in decision-making.

Some key points from the lecture include:

  • Machine learning is a subfield of artificial intelligence that focuses on how computers can learn from data.
  • There are three types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.
  • Supervised learning involves training on labeled data to make predictions on new data.
  • Unsupervised learning involves finding patterns or structure in unlabeled data.
  • Reinforcement learning involves learning from actions and consequences to maximize a reward signal.
  • High-quality data is essential for machine learning.
  • Reinforcement learning can be used to balance exploration and exploitation in decision-making.

Hi again. In this
lecture we’ll talk about machine learning and the different types
of machine learning. The machine learning,
as I mentioned, is a sub-field of
artificial intelligence. It’s mostly focused on how
do we get computers to learn from data without
explicitly programming them? They’re often used
for prediction tasks. For example, we
might have data on past credit card transactions, and we might be interested
in predicting whether a new transaction is
fraudulent or not, so we might look at
past data in order to make this decision. Or we might be interested
in determining whether an email is spam or not
based on past data. We might be looking at a task of analyzing images for
a driverless car and figuring out whether
the object in front of a car is another
vehicle or a person, or a tree, or something else. We might be interested
in recognizing speech and understanding speech
like with Alexa or Siri. In short, there
are many kinds of prediction tasks that use machine learning and these have applications in a
variety of industries, ranging from
healthcare, to finance, to manufacturing, to human
resources, and so on. Now, it’s important
to understand machine learning is not
one single technique, they’re really a large set of techniques all of which come under the umbrella
of machine learning. In fact, there are many
types of machine learning. For example, one way to think of machine learning is in terms
of supervised techniques, unsupervised techniques, and reinforcement
learning techniques. Supervised learning
is the idea of building a predictive
model based on past data, and these data have clearly labeled input
and output data. For example, we
might have data on emails in the past and nice and clear labels on which
of those past emails are spam emails and
which ones are not. We might then want
to learn from it. This is a classification task
which is using past data, which has nice
labels of inputs and outputs to learn how
to label future data. Unsupervised techniques,
in contrast, have a lot of input data but you don’t have clear
labels on output, and so these techniques are finding patterns
in the input data. For example, you might
have anomaly detection, which is the idea of finding certain data points that look like anomalies or
in other words, they look different from
all other data in there. Similarly, we talked about
clustering, previously, which is the idea of grouping a set of data points
into different groups, such that data points within a group are as similar
to each other, and data points in
different groups are as different
from each other. This is based on data, but we don’t have
clearly labeled output that is guiding us on how best to actually break up the data into
different clusters. Lastly, we have
reinforcement learning, which is the idea of
having a machine learning system acquire new data by taking actions and looking at the data to learn and
improve its future action. We will look at each of these techniques in greater detail. Let’s start with
supervised learning. As I mentioned,
supervised learning is the idea of
learning from data, where you have cleanly labeled
output and labeled input. The inputs can be referred to as features or as covariates, and the outputs are often
called targets of the model. This is what we’re
trying to predict. For example, as I mentioned, we have email data and
the output that we’re trying to predict is whether
an email is spam or not. The inputs or the features of the covariates are the
actual text in the email. With supervised learning,
the idea is that we have cleanly labeled past data
which have a correct answer, meaning that certain data
have been labeled as spam and certain
other data have been labeled as not being
spam and now we need to learn how to classify
future emails. Similarly, you might
have a desire to predict sales next week based
on historical data. We might use data on the season, the month of the
year, the weather, and other such patterns
to predict future sales. Our training data is actually past data which has
all these patterns, month, season, weather, and also the actual sales that were
realized in the past. Now we’re trying to make predictions in the
future based on that. Let’s look at another example
of supervised learning. In a recent research study, my colleagues and I were
interested in analyzing social media posts posted by a number of
companies on Facebook. We gathered data on
over 100,000 posts submitted by large
brands on Facebook. We wanted to identify what posts are associated
with the highest engagement. That is our emotional posts associated with
greater engagement, or humorous posts
or posts that show deals and promotions to
consumers or other posts. Now, it is very
expensive to tag 100,000 posts and label each post
as being humorous or not, emotional or not, as offering a price
discount or not, and so on. We wanted to automate
this process. We use a supervised machine learning
technique to do that. To do that, we first need data, a training dataset that has clearly labeled
inputs and outputs. The inputs are available to us. These are the words that
companies use in their posts. The output is essentially
a label that says whether the post is emotional
or humorous or not. To do that, we took
a sample of 5,000 posts and had human beings
label each of these posts. Every one of these
5,000 posts was labeled by a human being
as being humorous or emotional or as offering
a price discount or being a post that shares a remarkable
fact, and so on. These labels were then used as a training dataset for supervised machine
learning algorithm that learned what words are predictive of whether a post is emotional or humorous or not. Then that algorithm was used
to make predictions for the remaining nearly 100,000 posts that hadn’t been labeled
by a human being. This is essentially the idea of supervised machine learning, which is you need a training
dataset and you learn from that and you apply
that to future data. What we found in our study was that our machine
learning algorithm did well and often had accuracy
of over 90, 95 percent, and sometimes even
greater than 99 percent, in essentially being
able to predict whether a post is humorous or whether a
post is emotional or not. In any business application, if you have good high-quality
training dataset, one can apply these
techniques in order to make predictions
about the future. The key is collecting
high-quality data, and that is the most
important activity in supervised machine learning. There are a number of very good high-quality off
the shelf algorithms that can be applied to make
predictions if you’ve got high-quality training dataset
for machine learning. The next set of machine
learning techniques are unsupervised
learning techniques. Unsupervised learning
techniques also take in data, but they don’t have
clearly labeled output. For example,
clustering algorithms that we discussed previously. They tend to cluster our
data into different groups, but they are not
told in advanced what the ideal
clustering looks like, meaning there’s no
labeled output for them. Similarly, another example
is anomaly detection. Anomaly detection algorithms
look at a bunch of data and identify data points that look dissimilar to most
of the other data. Here again, there’s
a lot of input data, but there’s no clearly
labeled output. Another example is Latent
Dirichlet Allocation or LDA, which is a commonly used
technique for topic modeling, meaning identifying what topics a certain document might cover. Typically, with LDA, you have an input dataset which consists of a large
set of documents. The idea behind LDA is that each document likely covers
a small set of topics, and each topic itself tends to use the same set
of words quite frequently. For example, we might
take a large dataset of new stories published in all of the major newspapers and
online news media outlets, and feed that as an input
to an LDA algorithm. An LDA is trying to identify the topics that these
documents cover, but it’s not given
clearly labeled outputs, meaning that the algorithm
is not told that here’s a document on
politics and here’s a document on sports and so on. LDA as a set assumes that each document
covers very few topics, and each topic has a few words
that it uses frequently. When it takes a training dataset or an input dataset rather, LDA might identify that a certain topic tends to use certain words
quite frequently. For example, it might
say that here’s a topic that tends to
use the word Obama, the word Trump, the word speech, and a few other such
words quite frequently. But it does not tend
to use words like pizza or baseball as frequently. This clearly we can infer
is the topic of politics, and that’s something
that the algorithm identifies on its own. Now, given any document, LDA then looks at the
kinds of words that are used in this document and identifies which
topics it covers. Given a document, LDA might
say that a topic covers sports or a topic covers
politics and so on. Once LDA has been trained
using a large dataset, it can now be applied
to any new document, and it can
automatically classify these documents and identify
the topics in there. In this example,
you see a passage that LDA might analyze and it looks at certain words that
are used in this document. With each of these
words it identifies certain topics that these
words are related to. For example, arts, or education, or children. Then it identifies a set of topics that this
document covers. Now, in addition to
unsupervised learning, we also have this idea of
reinforcement learning. Reinforcement
learning usually does not take in large
training datasets. Rather, the algorithm learns by testing or trying various
actions or strategies, and observing what happens and using those observations
to learn something. This is a very
powerful method and has been used in a number of
robotics-based applications. It is also at the heart of a
software created by Google, which was called AlphaZero, which was an advanced version of Google’s Go playing
software, AlphaGo. AlphaGo had used
training dataset, which is based on past Go games. AlphaZero had no
training dataset. Instead, it learned the
game of Go by playing Go against itself and once it played millions of
games against itself, that was in fact the
training dataset that it used to develop the best
strategies for this game. Of course, in many settings, experimentation
isn’t always free, and so you have to balance
the cost of experimentation against exploiting the
knowledge that we already have. Let’s explore that through a reinforcement
learning algorithm known as multi-armed bandit. To illustrate how
bandit algorithms work, let’s consider setting
where you have two different ad
copies that we have designed and that we would like to try with our customers. We do not know which ad
copy is more effective in engaging customers and attracting them to
click on the ad. We would like to ideally figure out which ad is the
better ad to use. Now, one way to figure this out is to do what is
known as A/B testing. That is, we might show ad A to half the users and
ad B to half the users. We might do this for some period of time,
let’s say a day. Then we observe which ad has
the higher click-through rate and we might use
that ad there on. Now in this graph that
you see on this slide, we have two ads, ad A and ad B. Ad A, has a
click-through rate of five percent and ad B has a click-through
rate of 10 percent. But we do not know
this in advance. What we might end up
doing is show ad A to some users and show
ad B to some users. If we’ve shown these ads in a randomized version to a large number of
users, over time, we learn that ad A has five
percent click-through rate, ad B has 10 percent
click-through rate. Then we can use ad B
from that point onwards. But there is a cost
of this learning, because some people were shown ad A and some people
were shown ad B. During this learning step, the average
click-through rate that our ads experienced was
seven and a half percent, which is lower
than we would have obtained if we had chosen
the better-performing ad. Now, bandit algorithm
can do better, and it can improve performance. The way it does this, is that it starts off initially like any A/B testing algorithm, meaning it shows ad A and
ad B equal number of times. But it starts to observe what is happening
and is learning. For example, it
starts to observe that ad B is doing
better than ad A. As it learns this, it starts to show ad B
more frequently than ad A. It still will show
ad A a few times so, it still allows itself
to learn and correct itself in case ad A actually
will perform better. But over time it starts to
weigh ad B more and more, and as a result, if you observe at the
end of a day or in this example at the
end of 1,000 sessions, this ad which used a bandit algorithm based
allocation strategy, ended up having a click-through
rate that was much higher than seven and a
half percent that we obtained through A/B testing. It was not quite equal to the
10 percent that ad B has, but it’s close enough
because what it’s able to do is it’s able to
experiment and learn, and exploit that knowledge to
also improve the outcomes. In short, a reinforcement
learning algorithm is essentially an algorithm
that takes actions, observe what happens, and then improves its
performance over time.

Video: Reinforcement Learning

The text discusses reinforcement learning, specifically multi-armed bandit algorithms, as a powerful tool for making decisions in situations where continuous data is available and can be learned from to improve decisions. Examples of such situations include:

  • Personalizing a news media website to users
  • Determining which news articles to feature on the homepage
  • Personalizing a product page on an e-commerce website (e.g., Nanophone) to a consumer, including deciding which images to show, which product features to emphasize, and which discounts to offer

The key challenge in these situations is balancing exploration (gathering more information about the decision environment) and exploitation (making the best decision based on current information). Multi-armed bandit algorithms can help balance this trade-off.

The text also explains two algorithms for multi-armed bandit problems:

  • Epsilon-first: experiment early and then exploit what has been learned
  • Thompson sampling: initially allocate traffic equally to all choices, then adjust based on the results

Reinforcement learning has applications in gaming and online personalization, but is not as widely used as supervised machine learning.

Let’s next look at
reinforcement learning and in particular multi armed
bandit algorithms more closely. These algorithms are a powerful tool when
you have continuous data coming in and we can learn from the data
to improve decisions. For example, consider a media website, like a news media website that would like
to personalize the website to its users. Or determine for example which
of thousands of different news articles to profile at
the top of its homepage. Or consider an e-commerce retailer. Let’s call it nanophone, which is
a retailer of let’s say, mobile phones. When a consumer logs into the website,
nanophone needs to decide how to personalize
the product page to the consumer. They might have to decide which of 10
different images of the phone to show to the consumer. They might have to decide which product
features to emphasize to the consumer. For example, should they focus on
the battery life for the phone or should they focus on the sleek design or
some other product attribute. They might want to decide on which of
several different discounts to offer to these consumer, 0% discount,
5% discount, 10% discount. They might have multiple
calls to actions and they might have to choose
which call to action to use. So the action space or
the set of decisions of the set of choices available to the marketer in this
context are really very large. And the goal is to decide which actions
to choose in order to maximize revenues. At the heart of this problem is
the question of how much do we explore and how much do we exploit? And what we mean by that is exploration
is all about gathering more information about the decision environment. For example, asking the question, what might happen if I choose not to
emphasize the battery life as much and instead choose to focus on
the sleek design of the phone? In contrast, exploitation is about making the best
decision given the current information. Maybe based on the current information,
we believe the marketing message that most attracts the consumer is a
message that emphasizes the battery power. So should we go with that or
should we try something new? Now, we routinely use
the ideas of exploration and exploitation in our everyday lives. For example,
suppose you’re going to a restaurant, do you go to a completely new restaurant
which is the equivalent of exploration? Or do you go to your favorite restaurant
which you’ve been too many times and you know, it’s tried and tested? That is exploitation. And there are times at which
you might choose to go to your favorite restaurant
that is choose to exploit. And there are times at which you might
say, let’s try something new and learn about the restaurant,
even if it risks the possibility that we might not enjoy the food or
the experience. And that’s exploration. The question at the heart of these kinds
of decision problems is how do we balance exploration versus exploitation? How do we decide when to try
a completely new marketing message or a completely new say, web page to
the consumer versus when do we use something that has worked
reasonably well in the past. And this trade off is what is
really handled by algorithms like multi-armed bandit algorithms,
which as I mentioned earlier is sort of the classical
reinforcement learning approach. Now, multi armed bandit problem
is a problem in which a fixed or a finite set of resources must be
allocated among multiple choices. So for example, imagine a gambler and
a casino faced with a row of slot machines and the gambler must
decide which slot machines to pull. And the gambler only has a finite
amount of time in the casino and can therefore get only 100 or
200 pulls in different slot machines. So they must decide at any given time
whether to try a completely new slot machine and see if it’s associated with a much higher
probability of getting high rewards. Or should the gambler stick with
a slot machine that is already producing reasonable returns. Now, there are many algorithms that can
be used to balance this exploration and exploitation and
indeed there are many algorithms for multi armed bandit problems. For example, a strategy epsilon
first is essentially a heuristic in which we tend to experiment early,
that is explore a lot early. And then once we have learned a little
bit, then we start exploiting. So in the context of personalizing
the website for nanophone. What we might do is during the first few
weeks we might choose to explore and try many different marketing messages,
many different images and so on. And once we’ve learned what we wanted to
learn then we choose to exploit that is, allocate 100% of the traffic to
the best performing variant that we discovered in the first
few weeks of exploration. Another algorithm available
is Thompson sampling. What Thompson sampling might do for
the problem that nanophone faces is that it might initially allocate the traffic,
the web traffic coming into the website equally to all the different choices
that the company is considering. Meaning the choices like,
should we emphasize messages and visuals that show the battery? Should we use visuals and
messages that talk about the sleek design. Should we instead use visuals and messages that talk about the app
store of the phone and so on. And Thompson sampling will initially
allocate traffic to each of these choices with equal probability. But as more and more data comes in,
Thompson sampling will choose the alternatives that
are producing higher or better results. It will choose them with
higher probability. So, if for example the message and
visuals that emphasize the app store is the one that slowly but
steadily producing better results, then the probability with which that
choice has chosen will keep going up. Qualitatively speaking,
that’s what the algorithm does. Obviously the details of how the algorithm
works is probably not the most interest given our focus is to talk
about the business applications of AI. But hopefully you get a sense of
the intuition behind these approaches. In summary, while we assume that
machine learning is based on having access to large data sets,
reinforcement learning offers an alternative that
relies less on training data. But more on dynamic experimentation
to learn which strategies are doing better and
to use those more and more. Reinforcement learning has found
many applications in gaming and in online personalization. That said, today it is not as widely used
as other machine learning approaches, such as supervised machine learning. Given how pervasive supervised
machine learning is, especially in business settings, we will now deep dive into the world
of supervised machine learning methods.

Video: A Detailed View of Machine Learning

What is Machine Learning?

  • Machine learning is a type of AI that involves taking input variables and predicting an outcome variable.
  • Supervised machine learning is the most common type of machine learning, where the goal is to learn from labeled data to make predictions.

Example of Machine Learning

  • Predicting whether a user will purchase a product based on their website behavior, demographics, and device information.

Key Concepts

  • Input variables (X): The data used to make predictions, also known as features, predictors, or covariates.
  • Outcome variable (Y): The variable being predicted, also known as the output or target variable.
  • Function f: The relationship between the input variables and the outcome variable, which machine learning aims to approximate.

Factors that Drive Prediction Accuracy

  • Quantity of data: Having more observations (rows) helps increase accuracy.
  • Quality of data: Having more features (columns) and relevant information helps increase accuracy.
  • Relevance of information: Having relevant data is more important than having a large quantity of data.
  • Complexity of the model: More complex models can capture more complex relationships, but may also lead to overfitting.
  • Feature engineering: The ability to create new features or transform existing data to improve predictions.

Next Steps

  • The next lecture will cover specific machine learning algorithms to provide a better understanding of how they work.

Hello. In this lecture, we’re going to go into the
details of machine learning. In particular, I’ll try and provide a high-level
view of what is machine learning and what drives the accuracy of machine
learning models. For the purposes of
this discussion, I’ll tend to focus only on supervised
learning techniques. There’s a good reason for this. If we look at the
practice of AI, I would say that
almost 90 percent or maybe even higher
than 90 percent of practical business uses of AI tends to be machine learning. If you look at within
machine learning, almost 90 percent of
machine learning in practice tends to be
supervised machine learning. For the discussion of machine learning in
a very high level, I’ll tend to focus
our attention on supervised machine
learning algorithms. As I’d mentioned
earlier, at its core, supervised machine learning
is all about taking a set of input variables and predicting some
outcome variable. Now, we do this all the
time in our real life. For example, if you observe
dark clouds and strong winds, we might predict that
maybe it’s going to rain. Or we might look at what clothes somebody’s wearing or
better yet how they’re interacting with us
and we might make some inferences or
predictions about whether we’re likely to be
good friends with them. Or at the workplace, we might look at a person’s
educational background. We might look at
their job experience, at their skills, and predict whether they’ll be
successful at the job. These are all typical
prediction problems that we are solving on
a day-to-day basis. Now there are many
business applications of these kinds of predictions. Now, for example, in
a business setting, we are trying to predict whether somebody will buy a product. We’re trying to predict whether somebody will click our add. All of these predictions
can be made using supervised learning if we
have good training data. Let’s consider an example. Suppose we have data about
users coming to our website, we might know how many pages have they viewed on our
website in the past, we might know their zip code
based on their IP address. We might know what device they’re accessing our page from, and we might know the operating
system of that device, and ultimately we’re
trying to predict whether this person will purchase
a product or not. Now that’s a typical prediction
problem one might have. Now, for the data we
have about our users, meaning the input data that
we use to make predictions, one refers to these
input data as the predictors of our model
or the features in our model, or sometimes just the data and the variables or the
covariates of the model. There are many different names, but ultimately we can think of all these variables as
the inputs to our model. Now, given these inputs, we’re also trying to
predict something, that’s the output of our model
or the outcome variable. We might describe the input
variables using the letter X, and we might describe the outcome variable
using the letter Y. Now the prediction
problem that we’re facing is we’re
trying to figure out, given the input X, we’re trying to predict Y. That is, we are trying to
figure out some function f that takes X as an
input and predicts Y. The entire task of supervised machine learning
comes down to coming up with a highly accurate approximation
of this function f so that we can predict
Y as accurately as possible given the inputs X. Now, the notion of accuracy is built-in to what
I just described. For any prediction problem, we think of accuracy as essentially a measure
of how often are the predictions true or how close are the
predictions to reality. For example, if we’re trying to predict whether a person
purchases or not, if we make 100 prediction and 93 of those 100
predictions are correct, we might say that the prediction accuracy of
our model is 93 percent. We want this prediction
accuracy to be as high as possible and so
natural question is to ask, what drives the prediction
accuracy of a model? One factor that drives
the accuracy of a machine learning model
is the quantity of data, meaning the number of
distinct observations we have as an input
into the model. For example, if we have data on only 100 customers who’ve been to our website in the past, it would be very hard for us to make accurate predictions about future customers
visiting our website based on the observations
of just these 100 users. On the other hand, if we had
data for a million users, then we have a lot more
data and so clearly having more observations helps increase the accuracy of our model. Another driver of
prediction accuracy is how much do we know
about each observation? Now in the example
I just described, for each consumer, we only knew the number of pages
they viewed in the past. We knew the operating
system they use, and we also happen to
know their location. But this amount of
data might not be sufficient for us to actually
make the predictions. On the other hand, if we had much more data about each user. For example, we knew
that user’s interests, we also knew that
person’s income level, we also happen to know whether
they have made purchases from our website in the past and many more such observations, then the prediction accuracy of our model might go
up significantly. In other words, the two
factors that seem to drive the accuracy of the
model are the number of rows, which we can think about
the number of data points, and the number of columns, which we can think of as
the number of features or the number of X variables
available within each row. Together, they tend to drive the accuracy of machine
learning models in very significant ways. But they’re by no means
the only two factors. There’s a number
of other factors. For example, how relevant is the information
you have available? If we were trying to predict whether it’s going
to rain today, knowing how many people are
carrying umbrellas today is more useful than just knowing the color of the
clothes worn by people. So clearly having more
relevant data matters. Similarly, the complexity
of the model matters. If we restrict ourselves
to use very simple models, they might not be
able to capture very complex relationships that are out there in
the environment. Some of the more modern
machine learning methods like deep learning, which I will describe
in a later lecture, allow us to have more flexible and more
complex relationships between the input variables and
the outcome variables, and this helps increase the prediction
accuracy of models. Another factor is what is referred to as
feature engineering. This is essentially
the ability of an analyst to use their
domain knowledge to create new features or new input variables that are
predictive of the outcome. This comes down to having
deep domain knowledge and identifying what new data might we add to our
dataset or how might we transform our dataset so that we can better
make predictions. In short, there are many
different factors that drive the success of
machine learning models. Certainly one has to think
about many of these factors. It always starts with having high-quality and
high-volume data with lots of rows
and lots of columns. In the next lecture, we will talk about some specific machine
learning algorithms to give you a better
intuition of what exactly these machine
learning algorithms do.

Module 1 Quiz

What evidence is there that AI might be the next general-purpose technology?
Experts predict artificial general intelligence is only 50 years away.
It will be a highly impactful technology in the future, though it is not affecting us currently.
It is very difficult to store big data without a solid understanding of artificial intelligence.
There is widespread demand for AI and AI research skills across industries.

Which statement about big data is least accurate?
Useful to decision-makers in a wide number of industries
Can be used to answer new types of questions
Thought of as mostly a bigger version of previous forms of data analysis
Difficult to implement, requiring new organizational skills

One of the challenges of working with big data is that:
You must have a hypothesis in mind before analyzing the data.
Data generation has increased in recent years but computing capacity has remained largely stagnant.
Working with it involves a broad skillset and a wide range of tools.
Machine learning techniques are not applicable to big data.

What is the primary value of using data warehouses?
They provide a single point of access for data/analytics functions without affecting operations.
They allow for unstructured data storage.
They allow you to integrate with Amazon/Google.
They have bigger servers which allow for bigger data.

The main value of using MapReduce is:
Its ability to answer questions about future customers.
It integrates well with Microsoft Excel.
It allows for parallel computing which speeds up your query times.
It reduces any computation to a grid search algorithm.

Which of the following statements is/are true (select all that apply):
Predictive analytics is only useful for customer retention.
Fraud detection and recommendation systems are examples of predictive analytics.
Association rule mining looks for common co-occurrences in your data.
Clustering is the process of distributing your server load to multiple computers

Data mining is:
Another name for regression
A type of data warehouse
The process of discovering new data to include in your systems
A term encompassing tools for discovering patterns in large datasets

Which of the following statements is false?
Big data is a valuable complement to predictive analytics.
Clustering can be used for data-driven customer segmentation.
Data mining and big data are technical tools mostly useful for software developers
Amazon Redshift, Snowflake, and Google BigQuery are common data warehouse tools.

Recommender systems, such as those used by Amazon, are best characterized as being examples of:
MapReduce
Data warehouses
Database management systems
Predictive analytics

Artificial Narrow Intelligence, also known as weak AI:
Involves transferring knowledge from experts to a knowledge base
Can rapidly improve itself and do all things a human can do at a significant increase in speed and competency
Can do all intelligent things a human can do just as quickly and easily, although it cannot rapidly improve itself
Is AI that is very good at one specific task