You’ll start this course by exploring data modeling, common schemas, and database elements. You’ll consider how business needs determine the kinds of database systems that BI professionals implement. Then, you’ll discover pipelines and ETL processes, which are tools that move data and ensure that it’s accessible and useful.
Learning Objectives
- Identify and define key database models and schemas.
- Assess which database design pattern and schema is appropriate for different data.
- Discuss data model alternatives that would be optimal, performant, and adherent to the reporting requirements looking into current data size and growth.
- Define ETL and explain what it means.
- Identify key information from stakeholders necessary to create a data pipeline.
- Describe different types of pipelines.
- Describe the key stages of a data pipeline.
- Understand what a data pipeline is, its objectives, and how it works.
- Get started with data modeling, schemas, and databases
- Video: Introduction to Couse 2
- Video: Ed: Overcome imposter syndrome
- Reading: Course 2 overview
- Video: Welcome to module 1
- Video: Data modeling, design patterns, and schemas
- Video: Get the facts with dimensional models
- Video: Dimensional models with star and snowflake schemas
- Reading: Design efficient database systems with schemas
- Video: Different data types, different databases
- Reading: Database comparison checklist
- Practice Quiz: Test your knowledge: Data modeling, schemas, and databases
- Choose the right database
- How data moves
- Data-processing with Dataflow
- Organize data in BigQuery
- Video: Gather information from stakeholders
- Reading: Merge data from multiple sources with BigQuery
- Practice Quiz: Activity: Set up a sandbox and query a public dataset in BigQuery
- Reading: Unify data with target tables
- Practice Quiz: Activity: Create a target table in BigQuery
- Reading: Activity Exemplar: Create a target table in BigQuery
- Reading: Case study: Wayfair – Working with stakeholders to create a pipeline
- Review: Data models and pipelines
Get started with data modeling, schemas, and databases
Video: Introduction to Couse 2
This course will teach you how to build tools to provide stakeholders with ongoing insights by automating the process of pulling data from different sources, monitoring it, and providing data-driven insights. You will learn about design patterns and database schemas, data pipelines and ETL processes, database optimization, quality testing, and applying your skills to a realistic business scenario. By the end of this course, you will be able to use data modeling and pipelines to make other people’s jobs easier by automating, simplifying, and enhancing their processes.
As a BI professional, you aren’t just answering
your team’s questions, you’re empowering them with the data to answer
their own questions. By pinpointing the
answers they require, you can build tools that
enable them to access and use the data they
need when they need it. Hey, there. Welcome
to this course. If you’ve already completed
the previous one, you might remember me, but if you’re just
joining us, I’m Ed. I’m a product manager
here at Google. I’m really excited
to help you get started with data
models and extract, transform, and load,
or ETL pipelines. As you’ve been learning, BI professionals
are responsible for analyzing data to generate meaningful insights
and solve problems, answer questions, find patterns, and inform business decisions. A large part of this
is building tools to provide stakeholders
with ongoing insights. This course is going to focus
on those tools and how to automate them in order to pull data from
different sources, monitor it, and provide
data-driven insights. First, you’ll learn about design patterns and
database schemas, including common structures
that BI professionals use. You’ll also be introduced to data pipelines and
ETL processes. You’ve learned that
ETL stands for extract, transform, and load. This refers to the process of gathering data from
source systems, converting it into
a useful format, and bringing it into
a data warehouse or other unified
destination system. This will be an
important part of your job as a BI professional. You’ll also develop strategies for gathering information
from stakeholders in order to help you develop more useful tools and
processes for your team. After that, you’ll focus on database
optimization to reduce response time or
the time it takes for a database to
complete a user request. This will include exploring
different types of databases and the five factors
of database performance, workload, throughput, resources, optimization,
and contention. Finally, you’ll learn
about the importance of quality testing
your ETL processes, validating your database schema, and verifying business rules. Once you’ve finished
this course, you’ll apply your skills to a realistic business scenario, which is a great
way to demonstrate your BI knowledge to
potential employers. As you’ve been learning,
a large part of BI is making other people’s jobs
easier by automating, simplifying, and enhancing
their processes. For example, in one of
my projects I helped the central finance team aggregate years’ worth
of global sales. This allowed my team to identify the underlying
drivers that affected trends in prices and
quantities sold. They were then able
to clearly report these findings to
key stakeholders. I love solving
problems and making my teams lives a
little easier which is one of the reasons why I’m
so excited to teach you more about data modeling and
pipelines in this course. All right, let’s get started.
Video: Ed: Overcome imposter syndrome
Imposter syndrome is a belief that you are not where you need to be, in terms of your skill, your perspectives, background, or your experience. It is common among product managers, who often work with people who are very skilled in their areas of expertise.
One way to overcome imposter syndrome is to focus on your unique perspective. Everyone has a unique combination of skills, experience, and interests that they can bring to the table.
Another way to overcome imposter syndrome is to be vulnerable and transparent about how you are feeling. Talk to people you trust about your challenges and ask for help when you need it.
It is also important to focus on your strengths and give yourself credit for the things you do well. You are more likely to be successful by leaning into your strengths than by trying to hide your weaknesses.
Hi. I’m Ed. I’m a Product Manager at Google. As a product manager, I define the vision
for a product and make sure that it
aligns with what users actually need that
product to do for them. For me, imposter
syndrome is a belief that you are not
where you need to be, in terms of your skill, your perspectives, background,
or your experience. I’ve definitely experienced
imposter syndrome. I work with a lot of people who are very skilled in
their areas of expertise. Not everyone can have every single skill across the board. You end up thinking, oh, maybe I should be
able to program like her, or maybe I should be as good
a data scientist as him, and maybe I should have
this level of perspective as it seems like everyone
around me does as well. That’s not necessarily true. I really think it’s
important to focus on the unique perspective
that you provide, because everyone’s perspective and expertise and interest, and the way in which they’re
going to apply all of those, they’re all going to differ. There are unique
combinations of things that you provide that other
people cannot provide, simply by virtue of the
fact that you are you. I found that the most useful
technique in overcoming Imposter syndrome
is being vulnerable and transparent that
you feel that way. Find people that you trust, find people that
you can speak with and tell them how
you’re feeling. Tell them why you’re
feeling that way. Feeling a certain way
doesn’t have to be an indication of who you are
or what you’re capable of. It simply is. Being able to be
vulnerable and say, hey, I don’t understand this, or I would like a little
bit extra information, that can be helpful, not only for you, but also
for people around you. We tend to focus
on the negatives. We tend to focus on
the challenges or the constructive aspects that we might see that we
need to improve on. While not giving
ourselves enough credit for the things that we do well, the things that are strengths that we should lean more into. You’re going to be
more successful by understanding and
really leaning into your strengths
than simply trying to hide or runaway
from your failures.
Reading: Course 2 overview
Reading
Video: Welcome to module 1
This course will teach you about data modeling, schemas, and pipeline processes, which are essential tools for BI professionals. You will learn about the foundations of data modeling, common schemas, and key database elements. You will also learn how business needs determine the kinds of database systems that BI professionals implement. Finally, you will learn about pipelines and ETL processes, which are the tools that move data throughout the system and make sure it’s accessible and useful. By the end of this course, you will have added many more important tools to your BI toolbox.
Welcome to the first section of this
course. You’re going to learn about how BI professionals use data models to
help them build database systems, how schemas help professionals understand
and organize those systems and how pipeline processes move data from
one part of the system to another. We’ll start by exploring data modeling
foundations as well as common schemas and key database elements. We’ll also consider
how business needs determine the kinds of database systems that
a BI professional might implement. We’ll then shift to pipelines and ETL
processes, which are the tools that move data throughout the system and
make sure it’s accessible and useful. By the time you’re done, you’ll have added
many more important tools to your BI toolbox. Let’s get started.
Video: Data modeling, design patterns, and schemas
This video introduces data modeling, design patterns, and schemas.
- Data modeling is a way of organizing data elements and how they relate to one another.
- Design patterns are solutions that use relevant measures and facts to create a model to support business needs.
- Schemas are a way of describing how data is organized.
BI professionals use data modeling to create destination database models. These models organize the systems, tools, and storage accordingly, including designing how the data is organized and stored.
Design patterns and schemas are used in BI to create systems that are consistent and efficient. BI professionals use these tools to create data models that meet the specific needs of their businesses.
Data Modeling, Design Patterns, and Schemas
Data modeling is the process of organizing data elements and how they relate to one another. It is a way to create a conceptual model of data that can be used to design and implement databases.
Design patterns are reusable solutions to common data modeling problems. They provide a template for creating data models that are efficient and effective.
Schemas are a way of describing how data is organized in a database. They define the structure of the database, including the tables, columns, and relationships between them.
Data Modeling
Data modeling can be used to model data for a variety of purposes, including:
- Designing databases
- Developing data warehouses
- Creating data marts
- Designing data integration solutions
- Documenting data requirements
Data modeling is a complex process, but it is essential for creating efficient and effective databases.
Design Patterns
There are many different design patterns that can be used for data modeling. Some of the most common design patterns include:
- Entity-relationship (ER) diagrams: ER diagrams are used to model the entities and relationships between entities in a database.
- Dimensional modeling: Dimensional modeling is used to model data for analytical purposes. It is often used to create data warehouses and data marts.
- Normalized models: Normalized models are designed to minimize data redundancy and improve data integrity.
- Star schemas: Star schemas are a type of dimensional model that is often used for data warehousing.
- Snowflake schemas: Snowflake schemas are a more complex type of dimensional model that is often used for data warehousing.
Schemas
Schemas are used to describe the structure of a database. They define the tables, columns, and relationships between them. Schemas can be created using a variety of different tools, such as SQL or graphical database design tools.
Schemas are important for a number of reasons. They help to ensure that data is organized in a consistent way. They also make it easier to query and analyze data.
Using Design Patterns and Schemas in BI
Design patterns and schemas are used in BI to create systems that are consistent and efficient. BI professionals use these tools to create data models that meet the specific needs of their businesses.
For example, a BI professional might use a dimensional modeling design pattern to create a data warehouse for a retail company. The data warehouse would be used to store data about products, customers, and sales. The BI professional would also create a schema for the data warehouse, which would define the tables, columns, and relationships between them.
The BI professional could then use the data warehouse and schema to query and analyze data to answer questions such as:
- What are the most popular products?
- Which customers are spending the most money?
- What are the trends in sales?
Conclusion
Data modeling, design patterns, and schemas are essential tools for BI professionals. By understanding these concepts, BI professionals can create efficient and effective data models that meet the specific needs of their businesses.
Fill in the blank: In order to create an effective model, a design pattern uses _____ that are important to the business. Select all that apply.
measures
In order to create an effective model, a design pattern uses measures and facts that are important to the business.
facts
In order to create an effective model, a design pattern uses measures and facts that are important to the business.
In this video, we’re going
to explore data modeling, design patterns, and schemas. If you’ve been working
with databases or if you’re coming from the Google
Data Analytics certificate, you may be familiar with data modeling as a way to
think about organizing data. Maybe you’re even already using schemas to understand how
databases are designed. As you’ve learned, a database is a collection of data stored
in a computer system. In order to make
databases useful, the data has to be organized. This includes both
source systems from which data is ingested and moved and the destination database where it
will be acted upon. These source systems
could include data lakes, which are database systems
that store large amounts of raw data in its original
format until it’s needed. Another type of source system is an Online Transaction
Processing or OLTP database. An OLTP database is
one that has been optimized for data processing
instead of analysis. One type of destination
system is a data mart, which is a subject
oriented database that can be a subset of a
larger data warehouse. Another possibility is using an Online Analytical
Processing or OLAP database. This is a tool that has been optimized for analysis
in addition to processing and can analyze
data from multiple databases. You will learn more about
these things later. But for now, just understand
that a big part of a BI professional’s responsibility is to create the
destination database model. Then it will organize
the systems, tools and storage accordingly, including designing how the
data is organized and stored. These systems all play a part in the tools you’ll be
building later on. They’re important foundations
for key BI processes. When it comes to organization, you likely know that
there are two types of data: unstructured
and structured. Unstructured data
is not organized in any easily
identifiable manner. Structure data has been
organized in a certain format, such as rows and columns. If you’d like to revisit
different data types, take a moment to review this information from the
Data Analytics certificate. Now, it can be tricky to
understand structure. This is where data
modeling comes in. As you learned previously, a data model is a
tool for organizing data elements and how they
relate to one another. These are conceptual
models that help keep data consistent
across the system. This means that
give us an idea of how the data is
organized in theory. Think back to Furnese’s perfect train of
business intelligence. A data model is like a
map of that train system. It helps you navigate
the database by giving you directions
through the system. Data modeling is a process
of creating these tools. In order to create
the data model, BI professionals will often use what is referred to
as a design pattern. Design pattern is a solution
that uses relevant measures and facts to create a model
to support business needs. Think of it like a re-usable
problem-solving template, which may be applied to
many different scenarios. You may be more familiar
with the output of the design pattern,
a database schema. As a refresher, a
schema is a way of describing how something
such as data is organized. You may have encountered schemas before while working
with databases. For example, some
common schemas you might be familiar with
include relational models, star schemas, snowflake
schemas and noSQL schemas. These different
schemas enabled us to describe the model being
used to organize the data. If the design pattern
is the template for the data model then the schema is the
summary of that model. Because BI professionals play such an important role in
creating these systems, understanding data modeling is an essential part of the job. Coming up, you’re going to learn more about how design
patterns and schemas are used in BI and get
a chance to practice data modeling
yourself. Bye for now.
Video: Get the facts with dimensional models
This video introduces dimensional modeling, a type of relational modeling technique that is used in business intelligence. Dimensional models are optimized for quickly retrieving data from a data warehouse.
Dimensional models are made up of two types of tables: fact tables and dimension tables. Fact tables contain measurements or metrics related to a particular event. Dimension tables contain attributes of the dimensions of a fact. These tables are joined together using foreign keys to give meaning and context to the facts.
Dimensional modeling is a powerful tool for BI professionals because it allows them to quickly and easily analyze data. By understanding how dimensional modeling works, BI professionals can design database schemas that are efficient and effective.
In the next video, we will look at different types of schemas that can be used with dimensional modeling.
Get the facts with dimensional models in business intelligence
Dimensional modeling is a data modeling technique that is used in business intelligence to create data warehouses and data marts. Dimensional models are optimized for quickly retrieving data for analysis.
Dimensional models are made up of two types of tables: fact tables and dimension tables.
- Fact tables: Fact tables contain measurements or metrics related to a particular event. For example, a fact table for sales might contain columns for date, product, customer, and quantity sold.
- Dimension tables: Dimension tables contain attributes of the dimensions of a fact. For example, a dimension table for customers might contain columns for customer name, address, and phone number.
Fact tables and dimension tables are joined together using foreign keys to give meaning and context to the facts.
Dimensional modeling is a powerful tool for BI professionals because it allows them to quickly and easily answer questions about their data. For example, a BI professional could use a dimensional model to answer questions such as:
- What are the top-selling products by month?
- Which customers are spending the most money?
- What are the trends in sales over time?
Dimensional models are used by businesses of all sizes to make better decisions about their operations. For example, a retail company might use a dimensional model to identify which products are selling well and which customers are most profitable. A manufacturing company might use a dimensional model to track production costs and identify areas for improvement.
Here are some of the benefits of using dimensional models in business intelligence:
- Faster data retrieval: Dimensional models are optimized for quickly retrieving data for analysis. This is because fact tables are typically denormalized, which means that they contain redundant data. This redundancy makes it possible to join fact tables to dimension tables very quickly.
- Easier data analysis: Dimensional models are designed to make it easy to analyze data. The fact table contains the measurements or metrics that you want to analyze, and the dimension tables provide the context for those measurements. This makes it easy to see how different factors are related to each other.
- Flexibility: Dimensional models are flexible enough to be used to analyze a wide variety of data. This makes them a good choice for businesses that need to be able to analyze different types of data over time.
If you are a BI professional, it is important to understand how dimensional models work. By understanding dimensional modeling, you can design database schemas that are efficient and effective for data analysis.
if you’ve been working with database SQL you’re probably already familiar with
relational databases. In this video, you’re going to return to the concept
of relational databases and learn about a specific kind of relational
modeling technique that is used in business intelligence:
dimensional modeling. As a refresher, a relational database contains
a series of tables that can be connected to form relationships. These relationships are established
using primary and foreign keys. Check out this car dealership
database. Branch ID is the primary key in the car
dealerships table, but it is the foreign key in
the product details table. This connects these two tables directly. VIN is the primary key in
the product details table and the foreign key in the repair parts table. Notice how these connections actually
create relationships between all of these tables. Even the car dealerships and repair parts tables are connected
by the product details table. If you took the Google Data
Analytics Certificate, you learn that a primary key is an
identifier in the database that references a column in which each
value is unique. For BI, we’re going to expand this idea. A primary key is an identifier in
a database that references a column or a group of columns in which each row
uniquely identifies each record in the table. In this database we
have primary keys in each table. Branch ID, VIN, and part ID. A foreign key is a field within
a database table that’s a primary key in another table. The primary keys from each table also
appear as foreign keys in other tables. Which builds those connections. Basically, a primary key can be used to impose
constraints on the database that ensure data in a specific column is unique
by specifically identifying a record in a relational database table. Only
one primary key can exist in a table, but a table may have many foreign keys. Okay now let’s move on
to dimensional models. A dimensional model is a type of
relational model that has been optimized to quickly retrieve data
from a data warehouse. Dimensional models can be broken
down into facts for measurement and dimensions that add attributes for
context. In a dimensional model, a fact is a measurement or metric. For example a monthly sales number could
be a fact and a dimension is a piece of information that provides more detail and
context regarding that fact. It’s the who, what,
where, when, why and how. So if our monthly sales number is
the fact then the dimensions could be information about each sale, including
the customer, the store location and what products were sold. Next,
let’s consider attributes. If you earned your Google
Data Analytics certificate, you learned about attributes in tables. An attribute is a characteristic or
quality of data used to label the table
columns. In dimensional models, attributes work kind of the same way. An
attribute is a characteristic or quality that can be used to describe a dimension. So a dimension provides
information about a fact and an attribute provides
information about a dimension. Think about a passport. One dimension on your passport
is your hair and eye color. If you have brown hair and eyes, brown is the attribute
that describes that dimension. Let’s use another simple example to
clarify this; in our car dealership example if we explore the customer
dimension we might have attributes such as name, address and
phone number listed for each customer. Now that we’ve established the facts,
dimensions, and attributes, It’s time for
the dimensional model to use these things to create two types of tables:
fact tables and dimension tables. A fact table contains measurements or
metrics related to a particular event. This is the primary table
that contains the facts and their relationship with the dimensions. Basically each row in the fact
table represents one event. The entire table could aggregate
several events such as sales in a day. A dimension table is where attributes
of the dimensions of a fact are stored. These tables are joined the appropriate
fact table using the foreign key. This gives meaning and
context to the facts. That’s how tables are connected
in the dimensional model. Understanding how dimensional modeling
builds connections will help you understand database
design as a BI professional. This will also clarify database schemas
which are the output of design patterns. Coming up, We’re going to check out different kinds
of schemas that result from this type of modeling. To understand how these
concepts work in practice
Video: Dimensional models with star and snowflake schemas
A schema is a way of describing how data is organized in a database. It is the logical definition of the data elements, physical characteristics, and inter-relationships that exist within the model.
There are several common schemas that are used in business intelligence, including star, snowflake, and denormalized schemas.
- A star schema consists of one fact table that references any number of dimension tables. It is shaped like a star, with the fact table at the center and the dimension tables connected to it. Star schemas are designed for high-scale information delivery and make output more efficient because of the limited number of tables and clear direct relationships.
- A snowflake schema is an extension of a star schema with additional dimensions and, often, subdimensions. These dimensions and subdimensions break down the schema into even more specific tables, creating a snowflake pattern. Snowflake schemas can be more complicated than star schemas, but they can be useful for more complex analytical tasks.
- A denormalized schema is a schema that sacrifices some data integrity for performance. It is often used for data warehousing applications where speed is more important than accuracy.
By understanding the different types of schemas, BI professionals can choose the best schema for their specific needs.
Dimensional models with star and snowflake schemas in Business Intelligence
Dimensional models are a type of data model that is commonly used in business intelligence (BI). They are designed to make it easy to analyze data by organizing it into facts and dimensions. Facts are the measures of interest, such as sales revenue or customer satisfaction. Dimensions are the attributes that describe the facts, such as product category, customer location, or date.
Star and snowflake schemas are two common types of dimensional models.
Star schema
A star schema consists of one fact table that is connected to any number of dimension tables. The fact table contains the measures of interest, and the dimension tables contain the attributes that describe the measures.
The star schema is a simple and efficient design that is well-suited for high-volume analytical queries. It is also easy to understand and maintain.
Snowflake schema
A snowflake schema is an extension of the star schema. It adds additional dimensions and, often, subdimensions. Subdimensions are child tables that break down a dimension into more specific categories.
The snowflake schema can be more complex than the star schema, but it can be useful for more complex analytical tasks. For example, a snowflake schema could be used to analyze sales data by product category, customer location, and date.
Which schema to use?
The best schema for a particular BI project will depend on the specific needs of the business. If the project requires high-volume analytical queries, then a star schema is a good choice. If the project requires more complex analytical tasks, then a snowflake schema may be a better choice.
Here is an example of a star schema for a sales data warehouse:
- Fact table: Sales
- Measures: Sales revenue, quantity sold
- Dimensions: Product, customer, date
Here is an example of a snowflake schema for a sales data warehouse:
- Fact table: Sales
- Measures: Sales revenue, quantity sold
- Dimensions: Product
- Subdimensions: Product category, product subcategory
- Dimensions: Customer
- Subdimensions: Customer location, customer segment
- Dimensions: Date
Dimensional models with star and snowflake schemas are powerful tools for BI professionals. They can be used to analyze a wide variety of data and to generate insights that can help businesses make better decisions.
What type of schema consists of one fact table that references any number of dimension tables?
Star
A star schema consists of one fact table that references any number of dimension tables.
In a previous video, we explored how BI
professionals used dimensional models. They make it possible to organize data
using connected facts, dimensions and attributes, to create a design pattern. A schema is the final
output of that pattern. As you’ve learned, a schema is a way of
describing how something, such as data, is organized. In a database, it’s the logical definition of the data
elements, physical characteristics, and inter-relationships that
exist within the model. Think of the schema like a blueprint,
it doesn’t hold data itself, but describes the shape of the data and how it
might relate to other tables or models. Any entry in the database is
an instance of that schema and will contain all of the properties
described in the schema. There are several common schemas that you
may encounter in business intelligence, including star, snowflake and
denormalized, or NoSQL schemas. Star and snowflake schemas are some of
the most common iterations of an actual dimensional model in practice. A star schema is a schema consisting
of one fact table that references any number of dimension tables. As its name suggests,
this schema is shaped like a star. Notice how each of the dimension tables is
connected to the fact table at the center. Star schemas are designed to monitor
data instead of analyzing it. In this way, they enable
analysts to rapidly process data. Therefore they’re ideal for
high scale information delivery, and they make output more efficient because
of the limited number of tables and clear direct relationships. Next we have snowflake schemas, which tend
to be more complicated than star schemas, but the principle is the same. A snowflake schema is an extension of
a star schema with additional dimensions and, often, subdimensions. These dimensions and subdimensions break
down the schema into even more specific tables, creating a snowflake pattern. Like snowflakes in nature,
a snowflake schema and the relationships within
it can be complex. Here’s an example, notice how the fact
table is still at the center, but now there are subdimension tables
connected to the dimension tables, which gives us a more complicated web. Now you have a basic idea of the common
schemas you might encounter in BI. Understanding schemas can help you
recognize the different ways databases are constructed and how BI professionals
influence database functionality. Later on, you’re going to have more
opportunities to explore these different schemas and even construct some yourself.
Reading: Design efficient database systems with schemas
Video: Different data types, different databases
This video discusses several types of databases, including OLTP, OLAP, row-based, columnar, distributed, single-homed, separated storage and compute, and combined databases.
- OLTP (online transaction processing) databases are optimized for data processing instead of analysis. They are used for transactional applications, such as order processing and customer relationship management (CRM).
- OLAP (online analytical processing) databases are optimized for analysis. They are used for data warehousing and business intelligence applications.
- Row-based databases store data in rows and columns. They are good for transactional applications, but not as good for analytical queries.
- Columnar databases store data in columns. They are good for analytical queries, but not as good for transactional applications.
- Distributed databases store data across multiple physical locations. They are good for large data sets and scalability.
- Single-homed databases store all data in the same physical location. They are less common for large data sets.
- Combined systems store and analyze data in the same place. They are a traditional setup, but can become unwieldy as more data is added.
- Separated storage and computing systems store less relevant data remotely and relevant data locally for analysis. They are more efficient for analytical queries and can scale storage and computations independently.
BI professionals need to understand the different types of databases and how they are used in order to design effective data models and queries.
In this video, the speaker also discusses the importance of database migrations. Database migrations involve moving data from one source platform to another target database. This can be a complex process, but it is often necessary as businesses grow and technology changes.
BI professionals often play a key role in facilitating database migrations. They need to understand the source and target databases, as well as the data that needs to be migrated. They also need to develop a plan for the migration and ensure that the data is properly transformed and loaded into the target database.
Different data types, different databases in Business Intelligence
The type of data that you are storing and analyzing will determine the type of database that is best suited for your needs. For example, if you are storing and analyzing structured data, such as customer names and order details, then you will need a relational database. If you are storing and analyzing unstructured data, such as text and images, then you will need a NoSQL database.
Here is a brief overview of the different types of data types and the types of databases that are best suited for each type of data:
- Structured data: Structured data is data that is organized in a predefined format. It is typically stored in rows and columns in a relational database. Examples of structured data include customer names, order details, and product inventory.
- Unstructured data: Unstructured data is data that does not have a predefined format. It is typically stored in documents, images, and videos. Examples of unstructured data include customer reviews, social media posts, and medical images.
- Relational databases: Relational databases are designed to store and manage structured data. They use tables to store data, and relationships to connect the tables together. Relational databases are the most common type of database used in business intelligence.
- NoSQL databases: NoSQL databases are designed to store and manage unstructured data. They do not use tables and relationships, but instead use a variety of other data structures, such as key-value pairs, documents, and graphs. NoSQL databases are becoming increasingly popular for business intelligence applications that involve large amounts of unstructured data.
Here are some examples of how different data types are used in business intelligence:
- Using a relational database to store customer data: A relational database could be used to store customer name, address, phone number, and order history. This data could then be analyzed to identify trends in customer behavior, such as which products are most popular or which customers are most likely to churn.
- Using a NoSQL database to store social media data: A NoSQL database could be used to store social media posts from customers. This data could then be analyzed to identify customer sentiment, brand awareness, and customer churn.
- Using a relational database to store product data: A relational database could be used to store product name, description, price, and inventory levels. This data could then be analyzed to identify best-selling products, low-stock products, and products that are likely to be discontinued.
When choosing a database for your business intelligence application, it is important to consider the type of data that you will be storing and analyzing. If you are storing and analyzing structured data, then a relational database is the best choice. If you are storing and analyzing unstructured data, then a NoSQL database is the best choice.
It is also important to consider the size and complexity of your data set. If you have a large and complex data set, then you will need a database that can scale to meet your needs. Relational databases can scale well, but NoSQL databases are even better at scaling.
Finally, you need to consider the budget and resources that you have available. Relational databases are typically more affordable and easier to manage than NoSQL databases.
By considering these factors, you can choose the best database for your business intelligence application.
Which database framework features a collection of data systems that exist across many physical locations?
Distributed
A distributed database framework features a collection of data systems that exist across many physical locations.
As we continue our discussion of
data based modeling and schemas, it’s important to understand that there
are different facets of databases that a business intelligence
professional might need to consider for their organization. This is because the database framework,
including how platforms are organized and how data is stored and
processed, affects how data is used. Let’s start with an example. Think about
a grocery stores database systems. They manage daily business processes and
analyze and draw insights from data. For example, in addition to enabling users
to manage sales, a grocer’s database must help decision makers understand
what items customers are buying and which promotions are the most effective. In this video, we’re going to check out a
few examples of database frameworks and learn how they’re
different from one another. In particular, databases vary based
on how the data is processed, organized and stored. For this reason it’s important to know
what type of database your company is using. You will design different data models
depending on how data is stored and accessed on that platform. In addition,
another key responsibility for BI professionals is to
facilitate database migrations, which are often necessary when
technology changes and businesses grow. A database migration involves moving data
from one source platform to another target database. During a migration users transition
the current database schemas, to a new desired state. This could involve adding tables or
columns, splitting fields, removing elements, changing data types or
other improvements. The database migration process often
requires numerous phases and iterations, as well as lots of testing. These are huge projects for BI teams and
you don’t necessarily just want to take the original schema and
use it in the new one. So in this video we’ll discuss several
types of databases including OLTP, OLAP, Row-based, columnar,
distributed, single-homed, separated storage and
compute and combined databases. The first two database technologies
were going to explore, OLTP and OLAP systems, are based on
how data is processed. As you’ve learned, an online transaction
processing or OLTP database is one that has been optimized for
data processing instead of analysis. OLTP databases managed
database modification and are operated with traditional
database management system software. These systems are designed to
effectively store transactions and help ensure consistency. An example of an OLTP database
would be an online bookstore. If two people add the same book to their
cart, but there’s only one copy then the person who completes the checkout
process first will get the book. And the OLTP system ensures that there
aren’t more copies sold than are in stock. OLTP databases are optimized to read,
write and update single rows of data to ensure
that business processes go smoothly. But they aren’t necessarily designed
to read many rows together. Next, as mentioned previously, OLAP
stands for online analytical processing. This is a tool that has been optimized for
analysis in addition to processing and can analyze data from multiple databases. OLAP systems pull data from multiple
sources at one time to analyze data and provide key business insights. Going back to our online bookstore,
an OLAP system could pull data about customer purchases
from multiple data warehouses. In order to create personalized home pages
for customers based on their preferences. OLAP database systems enable organizations
to address their analytical needs from a variety of data sources. Depending on the data maturity of the
organization, one of your first tasks as a BI professional could be
to set up an OLAP system. Many companies have OLTP systems
in place to run the business, but they’ll rely on you to create a system
that can prioritize analyzing data. This is a key first step
to drawing insights. Now moving along to row-based and
columnar databases, as the name suggests,
Row based databases are organized by rows. Each row in a table is an instance or
an entry in the database and details about that instance
are recorded and organized by column. This means that if you wanted the average
profit of all sales over the last five years from the bookstore database. You would have to pull each row from
those years even if you don’t need all of the information contained in those rows. Columnar databases on the other hand,
are databases organized by columns. They’re used in data warehouses
because they are very useful for analytical queries. Columnar databases process data quickly, only retrieving information
from specific columns. In our average profit of all sales,
example, with a columnar database, you could choose to specifically pull
the sales column instead of years worth of rows. The next databases are focused on storage. Single-home databases are databases
where all the data is stored in the same physical location. This is less common for organizations
dealing with large data sets. And will continue to
become rarer as more and more organizations move their data
storage to online and cloud providers. Now, distributed databases are collection
of data systems distributed across multiple physical locations. Think about them like telephone books:
it’s not actually possible to keep all the telephone numbers in the world
in one book, it would be enormous. So instead, the phone numbers
are broken up by location and across multiple books in order
to make them more manageable. Finally, we have more ways of storing and
processing data. Combined systems,
our database systems that store and analyze data in the same place. This is a more traditional setup because
it enables users to access all of the data that needs to stay
in the system long-term. But it can become unwieldy as more data
is added. Like the name implies, separated storage and computing systems are databases where
less relevant data is stored remotely. And the relevant data is
stored locally for analysis. This helps the system run analytical
queries more efficiently because you interact with relevant data. It also makes it possible to scale
storage and computations independently. For example, if you have a lot of data but
only a few people are querying it, you don’t need as much computing power,
which can save resources. There are a lot of aspects of
databases that could affect the BI professionals work. Understanding if a system is OLTP or
OLAP, relational or columnar, distributed or
single-homed, separated storage and computing or combined, or even some
combination of these is essential. Coming up we’ll go even more in
depth about organizing data.
Reading: Database comparison checklist
Reading
In this lesson, you have been learning about the different aspects of databases and how they influence the way a business intelligence system functions. The database framework—including how platforms are organized and how data is stored and processed—affects how data is used. Therefore, understanding different technologies helps you make more informed decisions about the BI tools and processes you create. This reading provides a breakdown of databases including OLAP, OLTP, row-based, columnar, distributed, single-homed, separated storage and compute, and combined.
OLAP versus OLTP
Database technology | Description | Use |
---|---|---|
OLAP | Online Analytical Processing (OLAP) systems are databases that have been primarily optimized for analysis. | Provide user access to data from a variety of source systemsUsed by BI and other data professionals to support decision-making processesAnalyze data from multiple databasesDraw actionable insights from data delivered to reporting tables |
OLTP | Online Transaction Processing (OLTP) systems are databases that have been optimized for data processing instead of analysis. | Store transaction dataUsed by customer-facing employees or customer self-service applicationsRead, write, and update single rows of dataAct as source systems that data pipelines can be pulled from for analysis |
Row-based versus columnar
Database technology | Description | Use |
---|---|---|
Row-based | Row-based databases are organized by rows. | Traditional, easy to write database organization typically used in OLTP systemsWrites data very quicklyStores all of a row’s values togetherEasily optimized with indexing |
Columnar | Columnar databases are organized by columns instead of rows. | Newer form of database organization, typically used to support OLAP systemsRead data more quickly and only pull the necessary data for analysisStores multiple row’s columns together |
Distributed versus single-homed
Database technology | Description | Use |
---|---|---|
Distributed | Distributed databases are collections of data systems distributed across multiple physical locations. | Easily expanded to address increasing or larger scale business needsAccessed from different networksEasier to secure than a single-homed database system |
Single-homed | Single-homed databases are databases where all of the data is stored in the same physical location. | Data stored in a single location is easier to access and coordinate cross-teamCuts down on data redundancyCheaper to maintain than larger, more complex systems |
Separated storage and compute versus combined
Database technology | Description | Use |
---|---|---|
Separated storage and compute | Separated storage and computing systems are databases where less relevant data is stored remotely, and relevant data is stored locally for analysis. | Run analytical queries more efficiently because the system only needs to process the most relevant dataScale computation resources and storage systems separately based on your organization’s custom needs |
Combined storage and compute | Combined systems are database systems that store and analyze data in the same place. | Traditional setup that allows users to access all possible data at onceStorage and computation resources are linked, so resource management is straightforward |
Practice Quiz: Test your knowledge: Data modeling, schemas, and databases
A business intelligence professional stores large amounts of raw data in its original format within a database system. Then, they can access the data whenever they need it for their BI project. What type of database are they using?
Data lake
They are using a data lake. A data lake is a database system that stores large amounts of raw data in its original format until it’s needed.
AI assistant
The answer is data lake.
Data lakes are a type of database that are designed to store large amounts of raw data in its original format. This makes them ideal for BI professionals who need to store and analyze all of their data, regardless of its format or source.
Data lakes are typically used to store a variety of data types, including structured, semi-structured, and unstructured data. This allows BI professionals to get a complete view of their data and to identify trends and patterns that would not be possible if they were only looking at data from a single source.
Data lakes are also designed to be scalable, so they can grow as the amount of data increases. This is important for BI professionals who need to be able to access and analyze large amounts of data on a regular basis.
Here are some of the benefits of using a data lake for BI:
- Data storage: Data lakes are designed to store large amounts of data in its original format. This makes them ideal for BI professionals who need to store and analyze all of their data, regardless of its format or source.
- Data integration: Data lakes can be used to integrate data from a variety of sources, such as transactional databases, CRM systems, and ERP systems. This allows BI professionals to get a holistic view of the data and to identify trends and patterns that would not be possible if they were only looking at data from a single source.
- Data scalability: Data lakes are designed to be scalable, so they can grow as the amount of data increases. This is important for BI professionals who need to be able to access and analyze large amounts of data on a regular basis.
If you are a BI professional, I recommend using a data lake to store your raw data. This will give you the flexibility and scalability you need to be successful in your projects.
Which of the following statements correctly describe primary and foreign keys? Select all that apply.
- A primary key references a column or a group of columns in which each row uniquely identifies each record in the table.
- A foreign key forms connections because it is a field within a database table that is a primary key in another table.
A primary key references a column or a group of columns in which each row uniquely identifies each record in the table. A foreign key forms connections because it is a field within a database table that is a primary key in another table.
AI assistant
The following statements correctly describe primary and foreign keys:
- A primary key references a column or a group of columns in which each row uniquely identifies each record in the table.
- A foreign key forms connections because it is a field within a database table that is a primary key in another table.
Therefore, the correct answer is both of the above statements.
A primary key is a unique identifier for each record in a table. It cannot be null and cannot contain duplicate values. A foreign key is a column in one table that references the primary key of another table. It is used to establish a relationship between the two tables.
For example, let’s say we have two tables: a customers
table and an orders
table. The customers
table has a primary key called customer_id
. The orders
table has a foreign key called customer_id
that references the customer_id
primary key in the customers
table.
This relationship allows us to see which customer placed each order. For example, if we have an order with a customer_id
of 123, we can look up the customer_id
of 123 in the customers
table to see the name of the customer who placed the order.
Primary and foreign keys are essential for maintaining data integrity in relational databases. They help to ensure that each record in a table is unique and that the data in the two tables is consistent.
Fill in the blank: A _ schema is an extension of a star schema, which contains additional dimensions.
snowflake
A snowflake schema is an extension of a star schema, which contains additional dimensions.
AI assistant
The answer is snowflake.
A snowflake schema is an extension of a star schema, which contains additional dimensions. In a snowflake schema, the dimensions are normalized, which means that they are divided into smaller tables. This makes the schema more efficient for querying, but it can also make it more complex to manage.
Snowflake schemas are often used for large datasets, where the star schema would become too cumbersome. They are also used for datasets where the dimensions are highly interconnected.
Here is an example of a snowflake schema:
Fact table:
- order_id (primary key)
- customer_id
- product_id
- quantity
Dimensions:
- customer (customer_id, name, address, etc.)
- product (product_id, name, description, etc.)
- time (order_date, order_time, etc.)
Subdimensions:
- customer_address (customer_id, street, city, state, zipcode)
- product_category (product_id, category_name)
- time_hour (order_date, order_time, hour)
The customer
dimension is normalized into two subdimensions: customer_address
and product_category
. This allows for more efficient querying, as the data for each dimension is stored in a separate table.
Snowflake schemas can be complex to manage, but they can be very efficient for querying large and interconnected datasets.
What type of database stores relevant data locally for analysis and less relevant data remotely?
Separated storage and computing systems
Separated storage and computing systems are databases that store relevant data locally for analysis and less relevant data remotely.
AI assistant
The answer is combined systems.
Combined systems are a type of database that stores relevant data locally for analysis and less relevant data remotely. This allows for faster and more efficient querying of the relevant data, while also reducing the cost of storing the less relevant data.
Combined systems are often used for big data analytics, where the datasets can be very large and complex. By storing the relevant data locally, combined systems can provide real-time insights into the data.
Here is an example of a combined system:
- Local storage: The most relevant data is stored on a local server, which is close to the users who need to access it. This data can be queried very quickly, as it is stored on the same machine as the users.
- Remote storage: Less relevant data is stored on a remote server, which is typically located in a data center. This data can still be queried, but it will take longer, as the data has to be transferred over the network.
Combined systems can be complex to manage, but they can provide significant performance benefits for big data analytics.
The other options are not as accurate:
- Separated storage and computing systems: This option does not store any data locally. All of the data is stored remotely, which can lead to slower querying performance.
- Single-homed databases: This option stores all of the data locally. This can be expensive, as it requires a lot of storage space.
- Distributed databases: This option stores the data across multiple servers. This can be more efficient than storing all of the data on a single server, but it can also be more complex to manage.
Choose the right database
Video: The shape of the data
A data warehouse is a specific type of database that consolidates data from multiple source systems for data consistency, accuracy, and efficient access. It is used to support data-driven decision making.
BI professionals help design data warehouses by considering the following factors:
- Business needs: the questions the organization wants to answer or the problems they want to solve.
- Shape and volume of data: the rows and columns of tables within the warehouse and how they are laid out, as well as the current and future volume of data.
- Model: the tools and constraints of the system, such as the database itself and any analysis tools that will be incorporated into the system.
The data warehouse model for a bookstore would likely include a fact table for sales data and dimension tables for store, customer, product, promotion, time, stock, and currency. This would create a star schema, which is a common data warehouse model that is well-suited for answering specific questions and generating dashboards.
The logic behind data warehouse design is to organize the data in a way that is efficient for data analysis and that meets the specific needs of the business.
The shape of the data in business intelligence refers to the way the data is organized and structured. This includes the types of data, the relationships between the data, and the format of the data.
The shape of the data is important because it affects how easy it is to analyze and interpret the data. For example, if the data is well-organized and structured, it will be easier to use data mining and machine learning techniques to extract insights from the data.
There are a number of different ways to structure data for business intelligence. One common approach is to use a data warehouse. A data warehouse is a central repository for data from multiple source systems. The data in a data warehouse is typically organized in a star schema or snowflake schema.
A star schema is a data warehouse model that consists of a fact table and one or more dimension tables. The fact table contains the quantitative data, such as sales figures or customer churn rates. The dimension tables contain the qualitative data, such as customer demographics or product categories.
A snowflake schema is a more complex data warehouse model that consists of a fact table and multiple layers of dimension tables. The dimension tables are related to each other in a hierarchical fashion.
In addition to data warehouses, there are a number of other ways to structure data for business intelligence. For example, data can be stored in relational databases, NoSQL databases, or data lakes.
The best way to structure data for business intelligence will depend on the specific needs of the organization. However, there are a few general principles that can be followed:
- Organize the data in a logical way. The data should be structured in a way that makes it easy to understand and analyze.
- Use consistent naming conventions. This will make it easier to work with the data and to create reports and dashboards.
- Document the data. The data should be well-documented so that users understand what the data means and how it should be used.
Here are some tips for working with the shape of data in business intelligence:
- Identify the key dimensions of the data. What are the different categories or groups that the data can be divided into?
- Identify the key metrics of the data. What are the quantitative measurements that are most important to the business?
- Understand the relationships between the dimensions and metrics. How are the different dimensions and metrics related to each other?
- Choose the right data structure. The data structure should be appropriate for the type of data and the intended use of the data.
- Clean and prepare the data. The data should be cleaned and prepared before it is analyzed. This includes tasks such as correcting errors, removing outliers, and transforming the data into a consistent format.
By following these tips, you can work effectively with the shape of data in business intelligence to extract insights from the data and improve decision-making.
Fill in the blank: The shape of data refers to the rows and columns of tables within a data warehouse, as well as the _____ of data it contains.
volume
The shape of data refers to the rows and columns of tables within a data warehouse, as well as the volume of data it contains.
You’ve been investigating
data modeling and database schemas as well as how different
types of databases are used in BI. Now we’re going to explore how these
concepts can be used to design data warehouses. But before we get into
data warehouse design, let’s get a refresher on what
a data warehouse actually is. As you probably remember
from earlier in this course, a database is a collection of
data stored in a computer system. Well, a data warehouse is
a specific type of database that consolidates data from
multiple source systems for data consistency, accuracy and
efficient access. Data warehouses are used to support
data driven decision making. Often these systems are managed by
data warehousing specialists but BI professionals may help design them when
it comes to designing a data warehouse. There are a few important things
that BI professional will consider. Business needs, the shape and
volume of the data and what model the data warehouse will follow. Business needs are the questions
the organization wants to answer or the problems they want
to solve these needs. Help determine how it will use store and
organize its data. For example, hospital storing patient
records to monitor health changes has different data requirements than a
financial firm analyzing market trends to determine investment strategies. Next let’s explore the shape and
volume of data from the source system. Typically the shape of data
refers to the rows and columns of tables within the warehouse and
how they are laid out. The volume of data currently and in the
future also changes how the warehouse is designed and the model the warehouse will
follow includes all of the tools and constraints of the system,
such as the database itself and any analysis tools that will be
incorporated into the system. Let’s return to our bookstore example
to develop its data warehouse. We first need to work with stakeholders
to determine their business needs. You’ll have an opportunity to learn
more about gathering information from stakeholders later. But for now let’s say they tell
us that they’re interested in measuring store profitability and website traffic in order to evaluate
the effectiveness of annual promotions. Now we can look at the shape of the data. Consider the business processes or
events that are being captured by tables in the system because
this is a retail store. The primary business process is sales. We could have a sales table that includes
information such as quantity ordered, total based amount, total tax amount,
total discounts and total net amount. These are the facts as a refresher. A fact is a measurement or
metric used in the business process. These facts could be related to a series
of dimension tables that provide more context. For instance, store,
customer product promotion, time, stock or
currency could all be dimensions. The information in these tables gives more
context to our fact tables which record the business processes and events. Notice how this data model
is starting to shape up. There are several dimension tables all
connected to a fact table at the center and this means we just
created a star schema. With this model,
you can answer the specific question, the effectiveness of annual promotions and also generate a dashboard with
other KPIs and drill down reports. In this case, we started with
the businesses specific needs, looked at the data dimensions we had and organize them into tables
that formed relationships. Those relationships helped us determine
that a star schema will be the most useful way to organize this data warehouse. Understanding the logic behind
data warehouse design will help you develop effective by processes and
systems coming up, you’re going to work more
with database schemas and learn about how data is pulled into
the warehouse from other sources.
Video: Design useful database schemas
A database schema is a way of describing how data is organized. It doesn’t actually contain the data itself, but describes how the data is shaped and the relationships within the database.
A database schema should include the following four elements:
- Relevant data: The schema should include all of the data being described. Otherwise, it won’t be a very useful guide for users trying to understand how the data is laid out.
- Names and data types for each column: The schema should include the column names and the datatype to indicate what data belongs there.
- Consistent formatting: The schema should include consistent formatting across all of the data entries in the database. This means using the same data types for the same columns, and formatting the data in a consistent way.
- Unique keys for each entry: The schema should include unique keys for each entry within the database. This helps to ensure that the data is accurate and consistent.
A database schema is an important part of any BI project. It helps to ensure that the data is organized in a way that is efficient and easy to analyze.
How to design useful database schemas in business intelligence
A database schema is a blueprint for how data is organized in a database. It defines the tables, columns, and relationships between them. A well-designed schema is essential for efficient and effective business intelligence (BI).
Here are some tips for designing useful database schemas in BI:
- Understand the business needs. What questions do you want to be able to answer with the data? What reports do you need to generate? Once you understand the business needs, you can start to identify the entities and relationships that need to be represented in the schema.
- Choose the right data model. There are a number of different data models that can be used for BI, such as star schemas, snowflake schemas, and fact tables. The best data model for your needs will depend on the specific questions you want to answer and the type of data you have.
- Normalize the data. Normalization is the process of organizing data into tables in a way that reduces redundancy and improves data integrity. There are a number of different normalization levels, but it is generally recommended to normalize data to at least third normal form (3NF) for BI.
- Implement naming conventions. Consistent naming conventions will make the schema easier to understand and maintain. For example, you may want to use all caps for table names and lowercase with underscores for column names.
- Document the schema. It is important to document the schema so that users understand how the data is organized and how to use it. The documentation should include information about the tables, columns, relationships, and data types.
Here are some additional tips for designing useful database schemas in BI:
- Use descriptive column names.
- Avoid using reserved words in column names.
- Use surrogate keys for primary keys.
- Use foreign keys to define relationships between tables.
- Create indexes on frequently queried columns.
- Consider using partitioning to improve performance for large datasets.
- Test the schema thoroughly before deploying it to production.
By following these tips, you can design database schemas that are useful for BI. This will help you to extract valuable insights from your data and make better decisions.
Here are some examples of useful database schemas for BI:
- Star schema: A star schema is a common data model for BI. It consists of a fact table and one or more dimension tables. The fact table contains the quantitative data, such as sales figures or customer churn rates. The dimension tables contain the qualitative data, such as customer demographics or product categories.
- Snowflake schema: A snowflake schema is a more complex data model than a star schema. It consists of a fact table and multiple layers of dimension tables. The dimension tables are related to each other in a hierarchical fashion.
- Fact table: A fact table is a table that contains the quantitative data for a particular business process. For example, a sales fact table might contain data about sales orders, such as product ID, customer ID, and order amount.
The best data model for your needs will depend on the specific questions you want to answer and the type of data you have. However, the tips above will help you to design a database schema that is useful for BI.
Earlier, we learned about what considerations go into
designing data warehouses. Based on the business
needs and the shape of the data in
our previous example, we created the dimensional
model with a star schema. That process is sometimes
called Logical data modeling. This involves representing
different tables in the physical data model. Decisions have to be made about how a system will
implement that model. In this video, we’re
going to learn more about what a schema needs to have
for it to be functional. Later, you will use your
database schema to validate incoming data to prevent system errors and ensure
that the data is useful. For all of these reasons, it’s important to
consider the schema early on in any BI project. There are four elements a
database schema should include. The relevant data, names and data types for each
column and each table. Consistent formatting
across data entries and unique keys for every
database entry and object. As we’ve already learned, a database schema is a way of describing how
data is organized. It doesn’t actually
contain the data itself, but describes how the data is shaped and the relationships
within the database. It needs to include all of
the data being described. Or else it won’t be a
very useful guide for users trying to understand
how the data is laid out. Let’s return to our
bookstore database example. We know that our data contains a lot of information
about the promotions, customers, products,
dates, and sales. If our schema doesn’t
represent that, then we’re missing
key information. For instance, it’s often necessary for a BI
professional to add new information to an
existing schema if the current schema can’t answer a specific
business question. If the business wants to know which customer service employee responded the most to requests, we would need to add
that information to the data warehouse and update
the schema accordingly. The schema also needs
to include names and data types for each column in each table within
the database. Imagine if you didn’t organize
your kitchen drawers, it would be really difficult
to find anything if all of your utensils were
just thrown together. Instead, you probably have a specific place where you keep your spoons, forks and knives. Columns are like your
kitchen drawer organizers. They enable you to
know what items go where in order to keep
things functioning. Your schema needs to
include the column names and the datatype to indicate
what data belongs there. In addition to making
sure the schema includes all of
the relevant data, names and data types
for each column, it’s also important to
have consistent formatting across all of the data
entries in the database. Every day to entry is an
instance of the schema. For example, imagine we have two transactional systems that we’re combining
into one database. One tracks the promotion
sent to users, and the other track
sales to customers. In the source systems, the marketing system that tracks promotions could have
a user ID column, while the sale system
has customer ID instead. To be consistent in
our warehouse schema, we’ll want to use just
one of these columns. In the schema for this database, we might have a column in one of our tables for product prices. If this data is stored a string type data instead
of numerical data, it can’t be used in calculations such as adding sales
together in a query. Additionally, if any
of the data entries have columns that are
empty or missing values, this might cause issues. Finally, it’s important
that there are unique keys for each entry
within the database. We covered primary and foreign
keys in previous videos. These are what build connections between
tables and enable us to combine relevant data from cross the entire database. In summary, in order for a
database schema to be useful, it should contain the relevant
data from the database, the names and data types for
each column and each table, consistent formatting across
all of the entries within the database and unique
keys connecting the tables. These four elements
will ensure that your schema continues
to be useful. Developing your schema
is an ongoing process. As your data or
business needs change, you can continue to adapt the database schema to
address these needs. More to come on that soon.
Reading: Four key elements of database schemas
Reading
Whether you are creating a new database model or exploring a system in place already, it is important to ensure that all elements exist in the schema. The database schema enables you to validate incoming data being delivered to your destination database to prevent errors and ensure the data is immediately useful to users.
Here is a checklist of common elements a database schema should include:
- The relevant data: The schema describes how the data is modeled and shaped within the database and must encompass all of the data being described.
- Names and data types for each column: Include names and data types for each column in each table within the database.
- Consistent formatting: Ensure consistent formatting across all data entries. Every entry is an instance of the schema, so it needs to be consistent.
- Unique keys: The schema must use unique keys for each entry within the database. These keys build connections between the tables and enable users to combine relevant data from across the entire database.
Key takeaways
As you receive more data or business needs change, databases and schemas may also need to change. Database optimization is an iterative process, which means you may need to check the schema multiple times throughout the database’s useful life. Use this checklist to help you ensure that your database schema remains functional.
Reading: Review a database schema
Reading
So far, you’ve learned about the differences between various types of database schemas, the factors that influence the choice of database schemas, and how to design a database schema for a data warehouse using best practices.
In this reading, you’ll review a database schema created for a fictional scenario and explore the reasoning behind its design. In your role as a BI professional, you’ll need to understand why a database was built in a certain way.
Database schema
Francisco’s Electronics is launching an e-commerce store for its new home office product line. If it’s a success, company decision-makers plan to bring the rest of their products online as well. The company brought on Mia, a senior BI engineer, to help design its data warehouse. The database needed to store order data for analytics and reporting, and the sales manager needed to generate reports quickly to track the sales so that the success of the site can be determined.
Below is a diagram of the schema of the sales_warehouse database Mia designed. It contains different symbols and connectors that represent two important pieces of information: the major tables within the system and the relationships among these tables.
The sales_warehouse database schema contains five tables: Sales, Products, Users, Locations, and Orders, which are connected via keys. The tables contain five to eight columns (or attributes) that range in data type. The data types include varchar or char (or character), integer, decimal, date, text (or string), timestamp, bit, and other types depending on the database system chosen.
Review the database schema
To understand a database schema, it’s helpful to understand the purpose of using certain data types and the relationships between fields. The answers to the following questions justify why Mia designed Francisco’s Electronics’ schema this way:
- What kind of database schema is this? Why was this type of database selected?
Mia designed the database with a star schema because Francisco’s Electronics is using this database for reporting and analytics. The benefits of star schema include simpler queries, simplified business reporting logic, query performance gains, and fast aggregations.
- What naming conventions are used for the tables and fields? Are there any benefits of using these naming conventions?
This schema uses a snake case naming convention. In snake case, underscores replace spaces and the first letter of each word is lowercase. Using a naming convention helps maintain consistency and improves database readability. Since snake case for tables and fields is an industry standard, Mia used it in the database.
- What is the purpose of using the decimal fields in data elements?
For fields related to money, there are potential errors when calculating prices, taxes, and fees. You might have values that are technically impossible, such as a value of $0.001, when the smallest value for the United States dollar is one cent, or $0.01. To keep values consistent and avoid accumulated errors, Mia used a decimal(10,2) data type, which only keeps the last two digits after the decimal point.
Note: Other numeric values, such as exchange rate and quantities, may need extra decimal places to minimize rounding differences in calculations. Also, other data types may be better suited for other fields. To track when an order is created (created_at), you can use a timestamp data type. For other fields with various text sizes, you can use varchar.
- What is the purpose of each foreign and primary key in the database?
Mia designed the Sales table with a primary key ID and included foreign keys in the other tables to reference the primary keys. The foreign keys must be the same data type as their corresponding primary keys. As you’ve learned, primary keys uniquely identify precisely one record on a table, and foreign keys establish integrity references from that primary key to records in other tables.
Sales table key id & foreign keys | Associated table |
---|---|
order_id | Orders table |
product_id | Products table |
user_id | Users table |
shipping_address_id | Locations table |
billing_address_id | Locations table |
Key takeaways
In this reading, you explored why a database schema was designed in a certain way. In the world of business intelligence, you’ll spend a lot of time modeling business operations with data, exploring data, and designing databases. You can apply your knowledge of this database schema’s design to build your own databases in the future. This will enable you to use and store data more efficiently in your career as a BI professional.
Practice Quiz: Test your knowledge: Choose the right database
When designing a data warehouse, BI professionals take into account which of the following considerations? Select all that apply.
- The business needs
- The model that the data warehouse will follow
- The shape and volume of the data
When designing a data warehouse, BI professionals take into account the business needs, the shape and volume of data, and what model the data warehouse will follow.
Fill in the blank: Logical data modeling involves representing different _ in a physical data model.
Tables
Logical data modeling involves representing different tables in a physical data model.
A BI professional considers the relevant data for a project, the names and data types of table columns, formatting of data entries, and unique keys for database entries and objects. What will these activities enable them to accomplish?
Select appropriate elements for their database schema
These activities will enable the BI professional to select appropriate elements for their database schema.
How data moves
Video: Data pipelines and the ETL process
Data pipelines are a series of processes that transport data from different sources to their final destination for storage and analysis. They automate the flow of data from sources to targets while transforming the data to make it useful as soon as it reaches its destination.
Data pipelines are used to:
- Save time and resources
- Make data more accessible and useful
- Find what, where, and how data is combined
- Automate the processes involved in extracting, transforming, combining, validating, and loading data for further analysis and visualization
- Eliminate errors and combat system latency
Data pipelines can pull data from multiple sources, consolidate it, and then migrate it over to its proper destination. These sources can include relational databases, a website application with transactional data, or an external data source.
Data pipelines are often used in conjunction with ETL systems, which stand for extract, transform, and load. ETL is a type of data pipeline that enables data to be gathered from source systems, converted into a useful format, and brought into a data warehouse or other unified destination system.
Example:
An online streaming service wants to create a data pipeline to understand their viewers demographics to inform marketing campaigns. The stakeholders are interested in monthly reports.
The data pipeline would be set up to automatically pull in the data from the source systems at monthly intervals. Once the data is ingested, the pipeline would perform some transformations to clean and standardize it. The transformed data would then be loaded into target tables that have already been set up in the database.
Once the data pipeline is built, it can be scheduled to automatically perform tasks on a regular basis. This means that BI team members can focus on drawing business insights from the data rather than having to repeat the process of extracting, transforming, and loading the data over and over again.
Conclusion:
Data pipelines are a valuable tool for BI professionals. They can save time and resources, make data more accessible and useful, and help to eliminate errors and combat system latency.
What is a data pipeline?
A data pipeline is a series of processes that transport data from different sources to their final destination for storage and analysis. Data pipelines automate the flow of data from sources to targets while transforming the data to make it useful as soon as it reaches its destination.
What is the ETL process?
The ETL process stands for extract, transform, and load. It is a type of data pipeline that enables data to be gathered from source systems, converted into a useful format, and brought into a data warehouse or other unified destination system.
How do data pipelines and the ETL process work together?
Data pipelines and the ETL process work together to ensure that data is properly extracted, transformed, and loaded into a destination system where it can be used for analysis and reporting.
The following steps outline a typical data pipeline process:
- Extract: The data is extracted from the source systems.
- Transform: The data is transformed into a format that is compatible with the destination system and meets the needs of the business analysts and other users. This may involve cleaning and standardizing the data, aggregating or summarizing the data, and converting the data to different data types.
- Load: The data is loaded into the destination system. This may be a data warehouse, data lake, or data mart.
The following are some of the benefits of using data pipelines and the ETL process:
- Improved data quality: Data pipelines and the ETL process can help to improve the quality of data by cleaning and standardizing it. This can lead to more accurate and reliable insights.
- Reduced time and effort: Data pipelines and the ETL process can automate the tasks of extracting, transforming, and loading data. This can free up time and resources for business analysts and other users to focus on more important tasks.
- Increased scalability: Data pipelines and the ETL process can help to scale data operations. This is important for businesses that are growing rapidly or that need to process large volumes of data.
How to build a data pipeline
There are a number of different ways to build a data pipeline. The best approach will vary depending on the specific needs of the business. However, there are some general steps that can be followed:
- Identify the data sources and destination system: The first step is to identify the data sources and the destination system. The data sources may include relational databases, CRM systems, ERP systems, and other applications. The destination system is the system where the data will be stored and analyzed.
- Design the data pipeline: Once the data sources and destination system have been identified, the next step is to design the data pipeline. This involves determining how the data will be extracted from the source systems, transformed, and loaded into the destination system.
- Implement the data pipeline: Once the data pipeline has been designed, it can be implemented using a variety of tools and technologies. There are a number of commercial and open source data pipeline tools available.
- Test and monitor the data pipeline: Once the data pipeline has been implemented, it is important to test and monitor it to ensure that it is working properly. This includes testing the data extraction, transformation, and loading processes, as well as monitoring the performance of the pipeline.
Conclusion
Data pipelines and the ETL process are essential tools for business intelligence. They can help to improve the quality, scalability, and efficiency of data operations. This can lead to more accurate and reliable insights, as well as increased agility and competitiveness.
What are some of the key processes performed with data pipelines? Select all that apply.
- Define what, where, and how data is combined
- Help eliminate errors and latency
- Automate the extraction, transformation, combination, validation, and loading of data
Data pipelines are used to define what, where, and how data is combined. They automate the processes involved in extracting, transforming, combining, validating, and loading of data. They also help eliminate errors and latency.
So far, we’ve been
learning a lot about how data is
organized and stored within data warehouses and how schemas described those systems. Part of your job as
a BI professional is to build and maintain
a data warehouse, taking into consideration
all of these systems that exist and are collecting
and creating data points. To help smooth this process, we use data pipelines. As a refresher, a data pipeline is a series of processes
that transports data from different sources to their final destination
for storage and analysis. This automates the flow of
data from sources to targets while transforming
the data to make it useful as soon as it
reaches its destination. In other words,
data pipelines are used to get data from
point A to point B, automatically save time and resources and make data
more accessible and useful. Basically, data
pipelines to find what, where, and how data is combined. They automate the processes involved in extracting,
transforming, combining, validating, and loading data for further
analysis and visualization. Effective data
pipelines also help eliminate errors and
combat system latency. Having to manually move data over and over whenever
someone asks for it or to update a report repeatedly would be
very time-consuming. For example, if a
weather station is getting daily information
about weather conditions, it will be difficult
to manage it manually because of
the sheer volume. They need a system that takes in the data and gets it where it needs to go so it can be
transformed into insights. One of the most
useful things about a data pipeline is that it can pull data from multiple sources, consolidate it, and then migrate it over to its
proper destination. These sources can include
relational databases, a website application with transactional data or an
external data source. Usually, the pipeline has a push mechanism that
enables it to ingest data from multiple
sources in near real time or regular intervals. Once the data has been
pulled into the pipeline, it can be loaded to
its destination. This could be a data warehouse, data lake or data mart, which we’ll learn
more about coming up. Or it can be pulled
directly into a BI or analytics application
for immediate analysis. Often while data is being
moved from point A to point B, the pipeline is also
transforming the data. Transformations include
sorting, validation, and verification, making
the data easier to analyze. This process is
called the ETL system. ETL stands for extract,
transform, and load. This is a type of
data pipeline that enables data to be gathered
from source systems, converted into a useful format, and brought into
a data warehouse or other unified
destination system. ETL is becoming more and more standard for
data pipelines. We’re going to learn
more about it later on. Let’s say a business
analyst has data in one place and needs to
move it to another, that’s where a data
pipeline comes in. But a lot of the time, the structure of the
source system isn’t ideal for analysis which is why a BI professional
wants to transform that data before it gets
to the destination system and why having set
database schemas already designed and ready to receive data is so important. Let’s now explore these steps
in a little more detail. We can think of a data pipeline functioning in three stages, ingesting the raw data, processing and consolidating
it into categories, and dumping the data into reporting tables that
users can access. These reporting tables are
referred to as target tables. Target tables are the
predetermined locations where a pipeline data is sent
in order to be acted on. Processing and transforming data while it’s being moved is important because it ensures the data is ready to be
used when it arrives. But let’s explore this
process in action. Say we’re working with an
online streaming service to create a data pipeline. First, we’ll want to consider the end goal
of our pipeline. In this example, our
stakeholders want to understand their viewers demographics to
inform marketing campaigns. This includes information about their viewers ages
and interests, as well as where
they are located. Once we’ve determined what
the stakeholders goal is, we can start thinking
about what data we need the pipeline to ingest. In this case, we’re going to want demographic data
about the customers. Our stakeholders are
interested in monthly reports. We can set up our
pipeline to automatically pull in the data we want
at monthly intervals. Once the data is ingested, we also want our pipeline to perform some transformations, so that it’s clean
and consistent once it gets delivered
to our target tables. Note that these tables
would have already been set up within our
database to receive the data. Now, we have our customer
demographic data and their monthly
streaming habits in one table ready for
us to work with. The great thing
about data pipelines is that once they’re built, they can be scheduled
to automatically perform tasks on
a regular basis. This means BI team members can focus on drawing
business insights from the data rather than having to repeat this process
over and over again. As a BI professional, a big part of your job will involve creating these systems, ensuring that they’re
running correctly, and updating them whenever
business needs change. The valuable benefit that your team will
really appreciate.
Video: Maximize data through the ETL process
ETL is a type of data pipeline that enables data to be gathered from source systems, converted into a useful format, and brought into a data warehouse or other unified destination system. ETL processes work in three stages: extract, transform, and load.
Extract: The pipeline accesses source systems and reads and collects the necessary data.
Transform: The data is validated, cleaned, and prepared for analysis. The datatypes are also mapped from the sources to the target systems.
Load: The data is delivered to its target destination, such as a data warehouse, data lake, or analytics platform.
ETL processes are a common type of data pipeline that BI professionals often build and interact with.
Example:
A business wants to understand its monthly sales data. The sales data is stored in a transactional database, which is not optimized for analytical queries. The business can use an ETL process to extract the sales data from the transactional database, transform it into a format that is optimized for analysis, and load it into a data warehouse. The business can then use the data warehouse to analyze its monthly sales data and gain insights.
Benefits of ETL:
- Improved data quality
- Reduced time and effort
- Increased scalability
- Improved accuracy and reliability of insights
- Increased agility and competitiveness
Tutorial on Maximizing Data through the ETL Process in Business Intelligence
The ETL process is a critical component of any business intelligence (BI) system. It enables organizations to ingest, transform, and load data from a variety of sources into a centralized data warehouse or data lake. This makes the data more accessible and useful for analysis and reporting.
However, simply implementing the ETL process is not enough. Organizations need to carefully plan and execute their ETL processes in order to maximize the value of their data. Here are some tips:
- Identify your business goals. What do you want to achieve with your BI system? Once you know your goals, you can tailor your ETL process to ensure that it is aligned with them. For example, if you are interested in analyzing historical sales data, you will need to make sure that your ETL process extracts and loads all of the relevant sales data from your source systems.
- Understand your data sources. What data sources do you need to ingest? What is the format of the data in each source system? Once you have a good understanding of your data sources, you can design your ETL process accordingly.
- Design your data warehouse or data lake. Where will you store your transformed data? What schema will you use? It is important to design your data warehouse or data lake in a way that is optimized for your BI needs.
- Choose the right ETL tools and technologies. There are a variety of ETL tools and technologies available. Choose the ones that are best suited for your specific needs. Consider factors such as cost, scalability, and ease of use.
- Implement a data governance framework. Data governance is the process of managing and protecting data throughout its lifecycle. This includes establishing policies and procedures for data access, quality, and security. A data governance framework is essential for ensuring that your data is reliable and trustworthy.
Best practices for maximizing data through the ETL process
Here are some best practices for maximizing data through the ETL process:
- Use a data catalog. A data catalog is a repository of information about your data assets. It can help you to identify and understand your data sources, as well as the data that they contain.
- Clean and normalize your data. Data cleaning and normalization are essential for improving the quality of your data. Data cleaning involves removing errors and inconsistencies from the data. Data normalization involves converting the data into a consistent format.
- Transform your data for analysis. Your ETL process should transform your data into a format that is optimized for analysis. This may involve aggregating, summarizing, or joining data from different sources.
- Load your data into a data warehouse or data lake. Once your data has been cleaned, transformed, and loaded into a data warehouse or data lake, it can be used for analysis and reporting.
- Monitor your ETL process. It is important to monitor your ETL process to ensure that it is running smoothly and efficiently. You should also monitor the quality of your data to ensure that it is accurate and reliable.
By following these tips and best practices, organizations can maximize the value of their data through the ETL process. This can lead to improved decision-making, increased efficiency, and competitive advantage.
In which ETL stage would a business intelligence professional map data types from the sources to the target system in order to ensure the data fits the destination?
Transform
In the transform stage, a business intelligence professional maps data types from the sources to the target system in order to ensure the data fits the destination.
We’ve been learning a lot about data pipelines and
how they work. Now, we’re going to discuss a specific kind
of pipeline: ETL. I mentioned previously that ETL enables data to be gathered
from source systems, converted into a useful format, and brought into a
data warehouse or other unified
destination system. Like other pipelines,
ETL processes work in stages and these stages are
extract, transform, and load. Let’s start with extraction. In this stage, the
pipeline accesses a source systems
and then read and collects the necessary
data from within them. Many organizations store their data in
transactional databases, such as OLTP systems, which are great for
logging records or maybe the business
uses flat files, for instance, HTML or log files. Either way, ETL makes the
data useful for analysis by extracting it from
its source and moving it into a
temporary staging table. Next we have transformation. The specific
transformation activities depend on the structure and format of the destination and the requirement
of the business case, but as you’ve learned, these transformations
generally include validating, cleaning, and preparing
the data for analysis. This stage is also when the ETL pipeline maps
the datatypes from the sources to the
target systems so the data fits the
destination conventions. Finally, we have
the loading stage. This is when data is delivered
to its target destination. That could be a data
warehouse, a data lake, or an analytics platform that works with direct data feeds. Note that once the data
has been delivered, it can exist within
multiple locations in multiple formats. For example, there could be a snapshot table that
covers a week of data and a larger archive that has some of
the same records. This helps ensure
the historical data is maintained within the system while
giving stakeholders focused, timely data, and if the business
is interested in understanding and comparing
average monthly sales, the data would be moved to an OLAP system that have been optimized for
analysis queries. ETL processes are
a common type of data pipeline that
BI professionals often build and interact with. Coming up, you’re going to learn more about these systems
and how they’re created.
Video: Choose the right tool for the job
This video is about how BI professionals choose the right tool. Here are the key takeaways:
- Consider the KPIs, how your stakeholders want to view the data, and how the data needs to be moved.
- KPIs are quantifiable values that are closely linked to the business strategy.
- Stakeholders might ask for graphs, static reports, or dashboards.
- Some BI tools include Looker Studio, Microsoft Power BI, and Tableau.
- Some back-end tools include Azura Analysis Service, CloudSQL, Pentaho, SSAS, and SSRS SQL Server.
- Not all BI tools can read data lakes.
- Consider how to transfer the data, how it should be updated, and how the pipeline combines with other tools in the data transformation process.
- You might end up using a combination of tools to create the ideal system.
- BI tools have common features, so the skills you learn can be used no matter which tools you end up working with.
Choose the right tool for the job in Business Intelligence
Business Intelligence (BI) tools are essential for businesses of all sizes to make better decisions. However, with so many different BI tools on the market, it can be difficult to know which one is right for your business.
In this tutorial, we will discuss the key factors to consider when choosing a BI tool and provide recommendations for some of the most popular BI tools on the market.
Key factors to consider when choosing a BI tool
The following are some of the key factors to consider when choosing a BI tool:
- Features: What features are important to your business? Some common BI features include data visualization, reporting, dashboards, and analytics.
- Ease of use: How easy is the tool to use? Consider the skill level of your users when choosing a BI tool.
- Cost: How much does the tool cost? BI tools can range in price from free to thousands of dollars per month.
- Scalability: How scalable is the tool? Can it handle the volume and complexity of your data?
- Integration: Does the tool integrate with your existing systems?
Popular BI tools
Here are some of the most popular BI tools on the market:
- Tableau Tableau is a popular BI tool that is known for its ease of use and powerful data visualization capabilities. Tableau is a good choice for businesses of all sizes, but it can be expensive for larger businesses.
- Microsoft Power BI Microsoft Power BI is a powerful BI tool that is included with Microsoft Office 365. Power BI is a good choice for businesses that are already using Office 365, as it is easy to integrate with other Microsoft products.
- Looker Studio Looker Studio is a free BI tool that is easy to use and offers a variety of features, including data visualization, reporting, and dashboards. Looker Studio is a good choice for businesses of all sizes, but it may not be as powerful as some other BI tools.
- Qlik Sense Qlik Sense is a powerful BI tool that is known for its ability to handle large and complex data sets. Qlik Sense is a good choice for larger businesses that need a powerful BI tool that can scale with their needs.
- ThoughtSpot ThoughtSpot is a powerful BI tool that is known for its ability to perform complex analytics in real time. ThoughtSpot is a good choice for businesses that need a BI tool that can help them make quick decisions based on real-time data.
Choosing the right BI tool for your business
The best way to choose the right BI tool for your business is to consider the key factors discussed above. Think about the features that are important to you, the skill level of your users, your budget, and your scalability needs.
Once you have considered these factors, you can start to narrow down your choices. You may want to read reviews of different BI tools or try out a few different tools before making a decision.
Conclusion
Choosing the right BI tool is an important decision for any business. By considering the key factors discussed above, you can choose a BI tool that will help you make better decisions and improve your business performance.
In previous videos, we’ve been
exploring pipeline processes that ingest data from different sources, transform
it to match the destination formatting, and push it to a final destination where
users can start drawing business insights. BI professionals play a key role in
building and maintaining these processes, and they use a variety of tools
to help them get the job done. In this video, we’ll learn how BI
professionals choose the right tool. As a BI professional, your organization
will likely have preferred vendors, which means you’ll be given
a set of available BI solutions. One of the great things about BI is
that different tools have very similar principles behind them and
similar utility. This is another example
of a transferable skill. In other words, your general understanding
can be applied to other solutions, no matter which ones your
organization prefers. For instance, the first database management system
I learned was Microsoft Access. This experience helped me gain a basic
understanding of how to build connections between tables, and that made learning
new tools more straightforward. Later in my career,
when I started working with MySQL, I was already able to recognize
the underlying principles. Now it’s possible that you’ll
choose the tools you’ll be using. If that’s the case,
you’ll want to consider the KPIs, how your stakeholders want to view the
data, and how the data needs to be moved. As you’ve learned,
a KPI is a quantifiable value closely linked to the business strategy, which
is used to track progress toward a goal. KPIs let us know whether or
not we’re succeeding, so that we can adjust our processes
to better reach objectives. For example, some financial
KPIs are gross profit margin, net profit margin, and return on assets. Or some HR KPIs are rate of promotion and
employee satisfaction. Understanding your organization’s
KPIs means you can select tools based on those needs. Next, depending on how your
stakeholders want to view the data, there are different tools you can choose. Stakeholders might ask for graphs,
static reports, or dashboards. There are a variety of tools, including
Looker Studio, Microsoft, PowerBI and Tableau. Some others are Azura Analysis Service,
CloudSQL, Pentaho, SSAS, and SSRS SQL Server,
which all have reporting tools built in. That’s a lot of options. You’ll get more insights about
these different tools later on. After you’ve thought about how your
stakeholders want to view the data, you’ll want to consider
your back-end tools. This is when you think about
how the data needs to be moved. For example,
not all BI tools can read data lakes. So, if your organization uses
data lakes to store data, then you need to make sure you
choose a tool that can do that. Some other important considerations when
choosing your back-end tools include how to transfer the data,
how it should be updated, and how the pipeline combines with other
tools in the data transformation process. Each of these points helps you
determine must haves for your toolset, which leads to the best options. Also, it’s important to know that you
might end up using a combination of tools to create the ideal system. As you’ve been learning, BI tools have
common features, so the skills you learn in these courses can be used no matter
which tools you end up working with. Going back to my example, I was able to understand the logic behind
transforming and combining tables. Whether I was using Microsoft Access or
MySQL. This foundation has transferred across
the different BI tools I’ve encountered throughout my career. Coming up, you’ll learn more about the solutions
that you might work with in the future. You’ll also start getting
hands on with some data soon.
Reading: Business intelligence tools and their applications
Reading
As you advance in your business intelligence career, you will encounter many different tools. One of the great things about the skills you have been learning in these courses is that they’re transferable between different solutions. No matter which tools you end up using, the overall logic and processes will be similar! This reading provides an overview of many of these business intelligence solutions.
Tool | Uses |
---|---|
Azure Analysis Service (AAS) | Connect to a variety of data sourcesBuild in data security protocolsGrant access and assign roles cross-teamAutomate basic processes |
CloudSQL | Connect to existing MySQL, PostgreSQL or SQL Server databasesAutomate basic processesIntegrate with existing apps and Google Cloud services, including BigQueryObserve database processes and make changes |
Looker Studio | Visualize data with customizable charts and tablesConnect to a variety of data sourcesShare insights internally with stakeholders and onlineCollaborate cross-team to generate reportsUse report templates to speed up your reporting |
Microsoft PowerBI | Connect to multiple data sources and develop detailed modelsCreate personalized reportsUse AI to get fast answers using conversational languagesCollaborate cross-team to generate and share insights on Microsoft applications |
Pentaho | Develop pipelines with a codeless interfaceConnect to live data sources for updated reportsEstablish connections to an expanded libraryAccess an integrated data science toolkit |
SSAS SQL Server | Access and analyze data across multiple online databasesIntegrate with existing Microsoft services including BI and data warehousing tools and SSRS SQL ServerUse built-in reporting tools |
Tableau | Connect and visualize data quicklyAnalyze data without technical programming languagesConnect to a variety of data sources including spreadsheets, databases, and cloud sourcesCombine multiple views of the data in intuitive dashboardsBuild in live connections with updating data sources |
Reading: ETL-specific tools and their applications
Reading
In a previous reading, you were given a list of common business intelligence tools and some of their uses. Many of them have built-in pipeline functionality, but there are a few ETL-specific tools you may encounter. Creating pipeline systems—including ETL pipelines that move and transform data between different data sources to the target database—is a large part of a BI professional’s job, so having an idea of what tools are out there can be really useful. This reading provides an overview.
Tool | Uses |
---|---|
Apache Nifi | Connect a variety of data sources Access a web-based user interface Configure and change pipeline systems as needed Modify data movement through the system at any time |
Google DataFlow | Synchronize or replicate data across a variety of data sources Identify pipeline issues with smart diagnostic features Use SQL to develop pipelines from the BigQuery UI Schedule resources to reduce batch processing costs Use pipeline templates to kickstart the pipeline creation process and share systems across your organization |
IBM InfoSphere Information Server | Integrate data across multiple systems Govern and explore available data Improve business alignment and processes Analyze and monitor data from multiple data sources |
Microsoft SQL SIS | Connect data from a variety of sources integration Use built-in transformation tools Access graphical tools to create solutions without coding Generate custom packages to address specific business needs |
Oracle Data Integrator | Connect data from a variety of sources Track changes and monitor system performance with built-in features Access system monitoring and drill-down capabilities Reduce monitoring costs with access to built-in Oracle services |
Pentaho Data Integrator | Connect data from a variety of sources Create codeless pipelines with drag-and-drop interface Access dataflow templates for easy use Analyze data with integrated tools |
Talend | Connect data from a variety of sources Design, implement, and reuse pipeline from a cloud serverAccess and search for data using integrated Talend services Clean and prepare data with built-in tools |
Practice Quiz: Test your knowledge: How data moves
What is the term for the predetermined locations where pipeline data is sent in order to be acted on?
Target tables
Target tables are the predetermined locations where pipeline data is sent in order to be acted on.
A BI professional uses a pipeline to access source systems, then reads and collects the necessary data from within them. Which ETL stage does this scenario describe?
Extract
This describes the extract stage. During extraction, the pipeline accesses source systems, then reads and collects the necessary data from within them.
Many BI tools are built upon similar principles and often have similar utilities. Therefore, a BI professional’s general understanding of one tool can be applied to others. What is this an example of?
A transferable skill
Applying knowledge of one tool to another is an example of a transferable skill.
Data-processing with Dataflow
Video: Introduction to Dataflow
Google Dataflow is a serverless data processing service that can be used to create data pipelines. Dataflow pipelines can be created using Python, SQL, or pre-built templates. Dataflow also includes security features to help keep data safe.
Key points:
- Dataflow pipelines are made up of steps that read data from a source, transform it, and write it to a destination.
- Dataflow can be used to perform a variety of data processing tasks, such as batch processing, stream processing, and machine learning.
- Dataflow pipelines can be created using Python, SQL, or pre-built templates.
- Dataflow includes security features to help keep data safe.
How to use Dataflow:
- Log in to Google Dataflow.
- Go to the jobs page.
- Create a job from template or from SQL.
- Build your pipeline.
- Run your pipeline.
Additional tips:
- Use snapshots to save the current state of your pipeline.
- Use the pipeline section to view a list of your pipelines.
- Use the notebook section to create and share Jupyter Notebooks.
- Use the SQL workspace to write and execute SQL queries.
Introduction to Dataflow in Business Intelligence
Dataflow is a critical component of business intelligence (BI). It allows businesses to collect, transform, and load data from a variety of sources into a single, unified view. This unified view of data can then be used to create dashboards, reports, and other BI insights.
Benefits of using Dataflow in BI
There are several benefits to using Dataflow in BI, including:
- Scalability: Dataflow can scale to handle large volumes of data, making it ideal for enterprise BI applications.
- Flexibility: Dataflow can be used to transform data in a variety of ways, making it suitable for a wide range of BI use cases.
- Reliability: Dataflow is a reliable and secure service, making it ideal for mission-critical BI applications.
How to use Dataflow in BI
To use Dataflow in BI, you will need to:
- Identify your data sources. Dataflow can read data from a variety of sources, including relational databases, cloud storage, and streaming data sources.
- Define your data transformations. Dataflow can be used to perform a variety of data transformations, such as cleaning, filtering, and aggregating data.
- Create a Dataflow pipeline. A Dataflow pipeline is a sequence of steps that read data from a source, transform it, and write it to a destination.
- Run your Dataflow pipeline. Once you have created a Dataflow pipeline, you can run it to transform your data.
- Load your transformed data into your BI system. Once your data has been transformed, you can load it into your BI system to create dashboards, reports, and other BI insights.
Here is an example of how Dataflow can be used in BI:
A retail company wants to use Dataflow to create a BI dashboard that shows sales data by product category and region. The company has sales data stored in a relational database. The company also wants to include data from its e-commerce platform, which is stored in cloud storage.
To create the BI dashboard, the company would first need to create a Dataflow pipeline. The pipeline would read data from the relational database and the cloud storage. The pipeline would then transform the data to match the format required by the BI dashboard. Finally, the pipeline would write the transformed data to a destination, such as a data warehouse or BigQuery.
Once the data has been loaded into the destination, the company can use a BI tool to create a dashboard that shows sales data by product category and region.
Conclusion
Dataflow is a powerful tool that can be used to create a scalable, flexible, and reliable BI data pipeline. By using Dataflow, businesses can collect, transform, and load data from a variety of sources into a single, unified view. This unified view of data can then be used to create dashboards, reports, and other BI insights.
Recently, you’re introduced
to data pipelines. You learn that many of the procedures and
understandings involved in one pipeline tool can be transferred
to other solutions. So in this course we’re going
to be using Google Dataflow. But even if you end up working with
a different pipeline tool, the skills and steps involved here will be very useful. And using Google Dataflow now will
be a great opportunity to practice everything you’ve learned so far. We’ll start by introducing you to data
flow and going over its basic utilities. Later on you’ll use this tool to
complete some basic BI tasks and set up your own pipeline. Google Data Flow is a serverless
data-processing service that reads data from the source, transforms it, and
writes it in the destination location. Dataflow creates pipelines with open
source libraries which you can interact with using different languages
including Python and SQL. Dataflow includes a selection of pre-built
templates that you can customize or you can use SQL statements
to build your own pipelines. The tool also includes security
features to help keep your data safe. Okay, let’s open Dataflow and
explore it together now. First, we’ll log in and go to the console. Once the console is open,
let’s find the jobs page. If this is your first time using Dataflow,
it will say no jobs to display. The jobs page is where we’ll find
current jobs in our project space. There are options to create jobs from
template or create jobs from SQL. Snapshot save the current state
of a streaming pipeline so that you can start a new version
without losing the current one. This is great for testing your pipelines,
updating them seamlessly for users and backing up and recovering old versions. The pipeline section contains a list
of the pipelines you’ve created. Again, if this is your first time using
data flow, it will display the processes you need to enable before you
can start building pipelines. Now is a great time to do that. Just click fix all to enable the API
features and set your location. The Notebook section
enables you to create and save shareable Jupyter Notebooks
with live code. This is useful for first time ETL
tool users to check out examples and visualize the transformations. Finally, we have the SQL workspace. If you’ve worked with BigQuery before, such as in the Google Data Analytics
Certificate, this will be familiar. This is where you write and execute SQL
queries while working within Dataflow and there you go. Now you can log into Google Dataflow and
start exploring it on your own. We’ll have many more opportunities
to work with this tool soon.
Practice Quiz: [Optional] Activity: Create a Google Cloud account
Reading: Guide to Dataflow
Reading
Practice Quiz: [Optional] Activity: Create a streaming pipeline in Dataflow
Video: Coding with Python
Python is a popular programming language that is well-suited for business intelligence (BI). It is a general-purpose language that can be used to connect to databases, develop pipelines, and process big data.
Python is primarily object-oriented and interpreted. This means that it is modeled around data objects and that it is executed by an interpreter rather than being compiled.
One of the most valuable things about Python for BI is its ability to create and save data objects. This allows BI professionals to interact with data in a flexible and efficient way.
Python can also be used to create notebooks, which are interactive programming environments for creating data reports. This can be a great way to build dynamic reports for stakeholders.
Google Dataflow is a serverless data processing service that can be used to create data pipelines. Python can be used to write Dataflow pipelines, which allows BI professionals to take advantage of the scalability, flexibility, and reliability of Dataflow.
Key takeaways:
- Python is a popular programming language that is well-suited for BI.
- Python is primarily object-oriented and interpreted.
- Python can be used to create and save data objects, which is valuable for BI.
- Python can be used to create notebooks, which are interactive programming environments for creating data reports.
- Python can be used to write Dataflow pipelines, which allows BI professionals to take advantage of the scalability, flexibility, and reliability of Dataflow.
Coding with Python in Business Intelligence
Python is a powerful programming language that can be used for a variety of tasks in business intelligence (BI). It can be used to connect to databases, manipulate data, and create visualizations. Python is also a popular language for developing machine learning models, which can be used to automate BI tasks and generate insights.
Here are some of the ways that Python can be used in BI:
- Connecting to databases: Python can be used to connect to a variety of databases, including relational databases, cloud databases, and big data databases. This allows BI professionals to access and analyze data from a variety of sources.
- Manipulating data: Python has a number of libraries that can be used to manipulate data, such as NumPy and Pandas. These libraries allow BI professionals to clean, filter, and aggregate data in a variety of ways.
- Creating visualizations: Python can be used to create a variety of visualizations, such as charts, graphs, and maps. This allows BI professionals to communicate insights to stakeholders in a visually appealing and informative way.
- Developing machine learning models: Python is a popular language for developing machine learning models. These models can be used to automate BI tasks, such as forecasting and anomaly detection. Python also has a number of libraries that make it easy to deploy machine learning models to production.
Here is a simple example of how to use Python to connect to a database and query data:
import pymysql
# Connect to the database
conn = pymysql.connect(host='localhost', user='root', password='', db='mydb')
# Create a cursor
cur = conn.cursor()
# Execute a query
cur.execute('SELECT * FROM customers')
# Fetch the results
rows = cur.fetchall()
# Close the cursor and connection
cur.close()
conn.close()
# Print the results
for row in rows:
print(row)
This code will connect to a MySQL database called mydb
and execute a query to select all rows from the customers
table. The results of the query will then be printed to the console.
Here is a more complex example of how to use Python to manipulate data and create a visualization:
import pandas as pd
import matplotlib.pyplot as plt
# Read the data from a CSV file
df = pd.read_csv('sales_data.csv')
# Calculate the total sales for each product category
product_sales = df.groupby('product_category')['sales'].sum()
# Create a bar chart of the product sales
plt.bar(product_sales.index, product_sales.values)
# Set the chart title and labels
plt.title('Product Sales')
plt.xlabel('Product Category')
plt.ylabel('Sales')
# Show the chart
plt.show()
This code will read the sales data from a CSV file and calculate the total sales for each product category. The results will then be used to create a bar chart of the product sales.
These are just a few examples of how Python can be used in BI. Python is a powerful tool that can be used to automate tasks, generate insights, and communicate findings to stakeholders.
Here are some tips for using Python in BI:
- Use libraries: There are a number of Python libraries that can be used for BI tasks. These libraries can save you time and effort by providing pre-written code for common tasks.
- Start small: Don’t try to do too much too soon. Start with simple tasks and gradually work your way up to more complex tasks.
- Get help: There is a large community of Python users who are willing to help. If you get stuck, don’t be afraid to ask for help online or in forums.
Learning to use Python in BI can be a rewarding experience. By learning Python, you can automate your work, generate insights, and communicate your findings more effectively.
If you’re coming into
these courses from the Google Data
Analytics Certificate, or if you’ve been working
with relational databases, you’re probably familiar with
the query language, SQL. Query languages are specific computer
programming languages used to communicate
with a database. As a BI professional, you may be expected to use other kinds of programming
languages too. That’s why in this video, we’ll explore one of the most popular
programming languages out there, Python. A programming language
is a system of words and symbols used to write instructions
that computers follow. There are lots of different
programming languages, but Python was specifically
developed to enable users to write commands in fewer lines
than most other languages. Python is also open source, which means it’s freely
available and may be modified and shared by
the people who use it. There’s a large community
of Python users who develop tools and libraries
to make Python better, which means there are
a lot of resources available for BI
professionals to tap into. Python is a general purpose
programming language that can be applied to
a variety of contexts. In business intelligence,
it’s used to connect to a database system to
read and modify files. It can also be combined with other software tools to develop pipelines and it can even process big data and
perform calculations. There are a few key things
you should understand about Python as you begin
your programming journey. First, it is primarily
object-oriented and interpreted. Let’s first understand what it means to be object-oriented. Object-oriented
programming languages are modeled around data objects. These objects are chunks of code that capture
certain information. Basically, everything in
the system is an object, and once data has been
captured within the code, it’s labeled and defined
by the system so that it can be used again later without having to
re-enter the data. Because Python has been adopted pretty broadly by
the data community, a lot of libraries have been developed to pre-define
data structures and common operations that you can apply to the
objects in your system. This is extremely useful
when you need to repeat analysis or even use the same transformations
for multiple projects. Not having to re-enter the
code from scratch saves time. Note that object-oriented
programming languages differ from functional
programming languages, which are modeled
around functions. While Python is primarily
object-oriented, it can also be used as a functional
programming language to create and apply functions. Part of the reason Python is so popular is that it’s flexible. But for BI, the really valuable thing
about Python is its ability to create and save data objects that can then be
interacted with via code. Now, let’s consider
the fact that Python is an
interpreted language. Interpreted languages are
programming languages that use an interpreter; typically another program to read and execute
coded instructions. This is different from a
compiled programming language, which compiles
coded instructions that are executed directly
by the target machine. One of the biggest
differences between these two types of
programming languages is that the compiled code executed by the machine is almost
impossible for humans to read. So Python’s
interpreted language, it’s very useful for BI
professionals because it enables them to use language
in an interactive way. For example, Python can be
used to make notebooks. A notebook is an interactive, editable programming environment for creating data reports. This can be a great way to build dynamic reports
for stakeholders. Python is a great tool to
have in your BI toolbox. There’s even an option to use Python commands in
Google Dataflow. Pretty soon, you’ll get to check it out for
yourself when you start writing Python in
your Dataflow workspace.
Reading: Python applications and resources
Reading
In this course, you will primarily be using BigQuery and SQL when interacting with databases in Google DataFlow. However, DataFlow does have the option for you to work with Python, which is a widely used general-purpose programming language. Python can be a great tool for business intelligence professionals, so this reading provides resources and information for adding Python to your toolbox!
Elements of Python
There are a few key elements about Python that are important to understand:
- Python is open source and freely available to the public.
- It is an interpreted programming language, which means it uses another program to read and execute coded instructions.
- Data is stored in data frames, similar to R.
- In BI, Python can be used to connect to a database system to work with files.
- It is primarily object-oriented.
- Formulas, functions, and multiple libraries are readily available.
- A community of developers exists for online code support.
- Python uses simple syntax for straightforward coding.
- It integrates with cloud platforms including Google Cloud, Amazon Web Services, and Azure.
Resources
If you’re interested in learning Python, there are many resources available to help. Here are just a few:
- The Python Software Foundation (PSF): a website with guides to help you get started as a beginner
- Python Tutorial: a Python 3 tutorial from the PSF site
- Coding Club Python Tutorials: a collection of coding tutorials for Python
General tips for learning programming languages
As you have been discovering, there are often transferable skills you can apply to a lot of different tools—and that includes programming languages! Here are a few tips:
- Define a practice project and use the language to help you complete it. This makes the learning process more practical and engaging.
- Keep in mind previous concepts and coding principles. After you have learned one language, learning another tends to be much easier.
- Take good notes or make cheat sheets in whatever format (handwritten or typed) that works best for you.
- Create an online filing system for information that you can easily access while you work in various programming environments.
Organize data in BigQuery
Video: Gather information from stakeholders
Before building BI processes for stakeholders, BI professionals need to gather information about the current processes in place, the stakeholders’ goals, metrics, and final target tables. They also need to identify the stakeholders’ assumptions and biases about the project. BI professionals can do this by creating a presentation and leading a workshop session with the different teams, observing the stakeholders at work, and asking them questions.
Tutorial on how to gather information from stakeholders in Business Intelligence
Gathering information from stakeholders is an essential step in any Business Intelligence (BI) project. By understanding the needs of your stakeholders, you can ensure that your BI solutions are aligned with their goals and that they will have the data they need to make informed decisions.
Here are some tips on how to gather information from stakeholders in BI:
1. Identify your stakeholders.
The first step is to identify all of the stakeholders who will be impacted by your BI project. This could include executives, managers, analysts, and other employees who rely on data to make decisions.
2. Understand their needs and goals.
Once you have identified your stakeholders, you need to understand their individual needs and goals. What kind of data do they need? What questions are they trying to answer? What are their biggest pain points?
3. Use a variety of data gathering methods.
There are a variety of different ways to gather information from stakeholders. You can conduct interviews, surveys, workshops, or simply have informal conversations.
4. Be specific.
When gathering information from stakeholders, be as specific as possible. Don’t just ask them what data they need. Ask them specific questions about the types of reports they would like to see, the metrics they would like to track, and the insights they would like to gain from the data.
5. Be clear and concise.
When communicating with stakeholders, be clear and concise. Avoid using jargon or technical terms that they may not understand.
6. Be responsive.
Be responsive to stakeholder feedback. If they have any questions or concerns, be sure to address them promptly.
Here are some specific examples of questions you can ask stakeholders when gathering information:
- What are your biggest pain points when it comes to data?
- What kind of data do you need to make better decisions?
- What questions are you trying to answer with data?
- What metrics are most important to you?
- What reports would you like to see?
- How would you like to be able to access and interact with data?
- What are your goals for this BI project?
By gathering information from stakeholders and understanding their needs, you can ensure that your BI solutions are successful.
Here are some additional tips for gathering information from stakeholders in BI:
- Use a data dictionary. A data dictionary is a document that describes the data that is used in a BI system. It can be a helpful tool for communicating with stakeholders and ensuring that everyone is on the same page.
- Create a data governance plan. A data governance plan defines the policies and procedures for managing data in a BI system. It can help to ensure that the data is accurate, reliable, and secure.
- Provide training to stakeholders. Once you have implemented your BI solutions, it is important to provide training to stakeholders on how to use them. This will help them to get the most out of the data and to make better decisions.
By following these tips, you can effectively gather information from stakeholders in BI and ensure that your solutions meet their needs.
You’ve already
learned quite a bit about the different
stakeholders that a BI professional might work with in an organization and how
to communicate with them. You’ve also learned that gathering information
from stakeholders at the beginning of a project is an essential step
of the process. Now that you understand
more about pipelines, let’s consider what information
you need to gather from stakeholders before building
BI processes for them, that way you’ll know
exactly what they need and can help make their work
as efficient as possible. Part of your job as
a BI professional is understanding the
current processes in place and how you can integrate BI tools into those
existing workstreams. Oftentimes in BI, you aren’t just trying to answer individual questions every day, you’re trying to find
out what questions your team is asking so that you can build them
a tool that enables them to get that
information themselves. It’s rare for people
to know exactly what they need and
communicate that to you. Instead, they will
usually come to you with a list of
problems or symptoms, and it’s your responsibility to figure out how to help them. Stakeholders who are
less familiar with data simply don’t know what BI
processes are possible. This is why cross business
alignment is so important. You want to create a
user-centered design where all of the requirements for
the entire team are met, that way your solutions address
everyone’s needs at once, streamlining their
processes as a group. It can be challenging
to figure out what all of your different
stakeholders require. One option is to
create a presentation and lead a workshop session
with the different teams. This can be a great
way to support cross business alignment and
determine everyone’s needs. It’s also very helpful to
spend some time observing your stakeholders
at work and asking them questions about what
they’re doing and why. In addition, it’s important to establish the metrics
and what data the target table should contain early on with cross
team stakeholders. This should be done before
you start building the tools. As you’ve learned, a metric is a single quantifiable data point that is used to
evaluate performance. In BI, the metrics businesses are usually interested in are KPIs that help them assess how successful they are at
achieving certain goals. Understanding those goals
and how they can be measured is an
important first step in building a BI tool. You also know that
target tables are the final destination
where data is acted on. Understanding the end goals helps you design
the best process. It’s important to
remember that building BI processes is a collaborative
and iterative process. You will continue gathering information from your
stakeholders and using what you’ve
learned until you create a system that
works for your team, and even then you might
change it as new needs arise. Often, your stakeholders will have identified
their questions, but they may not have identified their assumptions or biases
about the project yet. This is where a BI professional
can offer insights. Collaborating closely
with stakeholders ensures that you are
keeping their needs in mind as you design the BI tools that will
streamline their processes. Understanding their
goals, metrics, and final target tables, and communicating
across multiple teams will ensure that you make
systems that work for everyone.
Reading: Merge data from multiple sources with BigQuery
Reading
Previously, you started exploring Google Dataflow, a Google Cloud Platform (GCP) tool that reads data from the source, transforms it, and writes it in the destination location. In this lesson, you will begin working with another GCP data-processing tool: BigQuery. As you may recall from the Google Data Analytics Certificate, BigQuery is a data warehouse used to query and filter large datasets, aggregate results, and perform complex operations.
As a business intelligence (BI) professional, you will need to gather and organize data from stakeholders across multiple teams. BigQuery allows you to merge data from multiple sources into a target table. The target table can then be turned into a dashboard, which makes the data easier for stakeholders to understand and analyze. In this reading, you will review a scenario in which a BI professional uses BigQuery to merge data from multiple stakeholders in order to answer important business questions.
The problem
Consider a scenario in which a BI professional, Aviva, is working for a fictitious coffee shop chain. Each year, the cafes offer a variety of seasonal menu items. Company leaders are interested in identifying the most popular and profitable items on their seasonal menus so that they can make more confident decisions about pricing; strategic promotion; and retaining, expanding, or discontinuing menu items.
The solution
Data extraction
In order to obtain the information the stakeholders are interested in, Aviva begins extracting the data. The data extraction process includes locating and identifying relevant data, then preparing it to be transformed and loaded. To identify the necessary data, Aviva implements the following strategies:
Meet with key stakeholders
Aviva leads a workshop with stakeholders to identify their objectives. During this workshop, she asks stakeholders questions to learn about their needs:
- What information needs to be obtained from the data (for instance, performance of different menu items at different restaurant locations)?
- What specific metrics should be measured (sales metrics, marketing metrics, product performance metrics)?
- What sources of data should be used (sales numbers, customer feedback, point of sales)?
- Who needs access to this data (management, market analysts)?
- How will key stakeholders use this data (for example, to determine which items to include on upcoming menus, make pricing decisions)?
Observe teams in action
Aviva also spends time observing the stakeholders at work and asking them questions about what they’re doing and why. This helps her connect the goals of the project with the organization’s larger initiatives. During these observations, she asks questions about why certain information and activities are important for the organization.
Organize data in BigQuery
Once Aviva has completed the data extraction process, she transforms the data she’s gathered from different stakeholders and loads it into BigQuery. Then she uses BigQuery to design a target table to organize the data. The target table helps Aviva unify the data. She then uses the target table to develop a final dashboard for stakeholders to review.
The results
When stakeholders review the dashboard, they are able to identify several key findings about the popularity and profitability of items on their seasonal menus. For example, the data indicates that many peppermint-based products on their menus have decreased in popularity over the past few years, while cinnamon-based products have increased in popularity. This finding leads stakeholders to decide to retire three of their peppermint-based drinks and bakery items. They also decide to add a selection of new cinnamon-based offerings and launch a campaign to promote these items.
Key findings
Organizing data from multiple sources in a tool like BigQuery allows BI professionals to find answers to business questions. Consolidating the data in a target table also makes it easier to develop a dashboard for stakeholders to review. When stakeholders can access and understand the data, they can make more informed decisions about how to improve services or products and take advantage of new opportunities.
Practice Quiz: Activity: Set up a sandbox and query a public dataset in BigQuery
Reading: Unify data with target tables
Reading
As you have been learning, target tables are predetermined locations where pipeline data is sent in order to be acted on in a database system. Essentially, a source table is where data comes from, and a target table is where it’s going. This reading provides more information about the data-extraction process and how target tables fit into the greater logic of business intelligence processes.
Data extraction
Data extraction is the process of taking data from a source system, such as a database or a SaaS, so that it can be delivered to a destination system for analysis. You might recognize this as the first step in an ETL (extract, transform, and load) pipeline. There are three primary ways that pipelines can extract data from a source in order to deliver it to a target table:
- Update notification: The source system issues a notification when a record has been updated, which triggers the extraction.
- Incremental extraction: The BI system checks for any data that has changed at the source and ingests these updates.
- Full extraction: The BI system extracts a whole table into the target database system.
Once data is extracted, it must be loaded into target tables for use. In order to drive intelligent business decisions, users need access to data that is current, clean, and usable. This is why it is important for BI professionals to design target tables that can hold all of the information required to answer business questions.
The importance of target tables
As a BI professional, you will want to take advantage of target tables as a way to unify your data and make it accessible to users. In order to draw insights from a variety of different sources, having a place that contains all of the data from those sources is essential.
Practice Quiz: Activity: Create a target table in BigQuery
Reading: Activity Exemplar: Create a target table in BigQuery
Reading
In this activity, you used BigQuery to create a target table to store data you pulled from a dataset of street tree information from San Francisco, California. In your BI role, you’ll need to use programs such as BigQuery and Dataflow to move and analyze data with SQL. Now, you’ve practiced a key part of the Extraction stage of the BI pipeline: pulling data from a source and placing it into its own table.
The exemplar you’re about to review will help you evaluate whether you completed the activity correctly. Because this activity involves copying, pasting, and executing a complete SQL query, you will just need to check that your result matches this exemplar.
If you find that the result you received is different from the exemplar provided, double check the formatting of the query you copied. Review the explanation of the SQL query in this activity to learn more about how the SQL query works and how to write your own in your projects.
Access the exemplar
To explore the query result exemplar, download the following attachment:
In this activity, you ran the following SQL query to create a target table:
SELECT
address,
COUNT(address) AS number_of_trees
FROM
`bigquery-public-data.san_francisco_trees.street_trees`
WHERE
address != "null"
GROUP BY address
ORDER BY number_of_trees DESC
LIMIT 10;
- The SELECT clause selects the address of each tree. By using the COUNT function, you count the number of trees at each address and return a single row of data per address, instead of per tree. This data is saved as a new column.
- The FROM clause is straightforward as it specifies the street_trees table within the San Francisco Street Trees dataset.
- The WHERE clause is necessary to ensure that your target table only includes rows that have a value in the address column.
- The GROUP BY clause specifies that you’re grouping data by the address, and the ORDER BY clause sorts the data in descending order by number_of_trees column.
- The LIMIT clause limits the query to return only the top ten rows of data. When working with large datasets, including a limit will decrease the processing time required to return the data.
If you need a refresher on SQL code, review some resources from the Google Data Analytics Certificate: Review Google Data Analytics Certificate content about SQL and Review Google Data Analytics Certificate content about SQL best practices.
The result of this query is a target table with two columns. It features the address column, as well as the total number of trees planted at the address you calculated in the SELECT clause. If properly executed, the first value in the address column is 100x Cargo Way. Next to it, the number_of_trees is 135. If you didn’t receive this result, please review the code and run it again.
Furthermore, the target table shows the 10 addresses with the most trees planted by the Department of Public Works in the city of San Francisco:
- 100x Cargo Way
- 700 Junipero Serra Blvd
- 1000 San Jose Ave
- 1200 Sunset Blvd
- 1600 Sunset Blvd
- 2301 Sunset Blvd
- 1501 Sunset Blvd
- 2401 Sunset Blvd
- 100 STAIRWAY5
- 2601 Sunset blvd
And the number of trees for each address is as follows:
- 100x Cargo Way: 135
- 700 Junipero Serra Blvd: 125
- 1000 San Jose Ave: 113
- 1200 Sunset Blvd: 110
- 1600 Sunset Blvd: 102
- 2301 Sunset Blvd: 94
- 1501 Sunset Blvd: 93
- 2401 Sunset Blvd: 92
- 100 STAIRWAY5: 87
- 2601 Sunset Blvd: 84
Key takeaways
Target tables are the destination for data during the Extraction stage of a pipeline. You’ll use them in your role as a BI professional to store data after pulling it from their sources. Once they’re in a target table, you can transform them with BigQuery or Dataflow and load them into reporting tables. You’ll learn about the Transform and Load stages of data pipelines later in this course.
Reading: Case study: Wayfair – Working with stakeholders to create a pipeline
Review: Data models and pipelines
Video: Wrap-up
This section of the course has covered the following topics:
- A BI professional’s role in the organization and storage of data
- Data models and schemas
- Design patterns based on the organization’s needs
- Database design
- Data pipelines, ETL processes, and building BI tools to automate data movement
- Strategies for gathering information from stakeholders to ensure that the tools solve business problems
The next section of the course will cover how to maintain BI tools and optimize database systems.
Hey, great work so far. You’re almost done with
the first section of this course. You’ve
learned a lot. So far, we’ve discussed a BI professional’s role in the organization and
storage of data. You also investigated
data models and schemas, how BI professionals develop design patterns based on
the organization’s needs, and how databases are designed. You’ve been introduced to data
pipelines, ETL processes, and building BI tools
that help automate moving data from storage systems
to target destinations. You’ve even started
using tools to begin building your
own pipelines. Finally, you learned strategies for gathering information from stakeholders to ensure
that the tools you create for them actually
solve the business problems. But creating systems that
manage and move data is just one part of a
BI professional’s job. You also have to make
sure that those systems continue working for
your stakeholders. Coming up, you’re going
to discover how to maintain your BI tools and
optimized database systems. I hope you’re excited to learn more because there’s a lot more I wanted to share with you. But first, you have
another challenge ahead. Feel free to spend some time
with the glossary and review any of the section’s content before moving on to
your next assessment. Then I’ll be here when you’re
ready to take next step.
Reading: Glossary terms from module 1
Reading
Attribute: In a dimensional model, a characteristic or quality used to describe a dimension
Columnar database: A database organized by columns instead of rows
Combined systems: Database systems that store and analyze data in the same place
Compiled programming language: A programming language that compiles coded instructions that are executed directly by the target machine
Data lake: A database system that stores large amounts of raw data in its original format until it’s needed
Data mart: A subject-oriented database that can be a subset of a larger data warehouse
Data warehouse: A specific type of database that consolidates data from multiple source systems for data consistency, accuracy, and efficient access
Database migration: Moving data from one source platform to another target database
Dimension (data modeling): A piece of information that provides more detail and context regarding a fact
Dimension table: The table where the attributes of the dimensions of a fact are stored
Design pattern: A solution that uses relevant measures and facts to create a model in support of business needs
Dimensional model: A type of relational model that has been optimized to quickly retrieve data from a data warehouse
Distributed database: A collection of data systems distributed across multiple physical locations
Fact: In a dimensional model, a measurement or metric
Fact table: A table that contains measurements or metrics related to a particular event
Foreign key: A field within a database table that is a primary key in another table (Refer to primary key)
Functional programming language: A programming language modeled around functions
Google DataFlow: A serverless data-processing service that reads data from the source, transforms it, and writes it in the destination location
Interpreted programming language: A programming language that uses an interpreter, typically another program, to read and execute coded instructions
Logical data modeling: Representing different tables in the physical data model
Object-oriented programming language: A programming language modeled around data objects
OLAP (Online Analytical Processing) system: A tool that has been optimized for analysis in addition to processing and can analyze data from multiple databases
OLTP (Online Transaction Processing) database: A type of database that has been optimized for data processing instead of analysis
Primary key: An identifier in a database that references a column or a group of columns in which each row uniquely identifies each record in the table (Refer to foreign key)
Python: A general purpose programming language
Response time: The time it takes for a database to complete a user request
Row-based database: A database that is organized by rows
Separated storage and computing systems: Databases where data is stored remotely, and relevant data is stored locally for analysis
Single-homed database: Database where all of the data is stored in the same physical location
Snowflake schema: An extension of a star schema with additional dimensions and, often, subdimensions
Star schema: A schema consisting of one fact table that references any number of dimension tables
Target table: The predetermined location where pipeline data is sent in order to be acted on
Quiz: Module 1 challenge
Which of the following statements correctly describe Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) tools? Select all that apply.
- Online Transaction Processing (OLTP) tools are optimized for data processing instead of analysis.
- Online Analytical Processing (OLAP) tools are optimized for analysis in addition to processing and can analyze data from multiple databases.
AI assistant
Fill in the blank: In order to create an effective data model, business intelligence professionals will often apply a _, which uses relevant measures and facts to create a model that supports business needs.
design pattern
AI assistant
The answer is design pattern.
Business intelligence professionals often apply a design pattern to create an effective data model. A design pattern is a reusable solution to a common problem in software design. In the context of data modeling, design patterns provide a way to create data models that are scalable, extensible, and maintainable.
There are a number of different data modeling design patterns, but one of the most common is the star schema. A star schema is a data model that is optimized for analytical processing. It consists of a central fact table and a number of dimension tables. The fact table contains the measures that will be analyzed, and the dimension tables contain the attributes that will be used to filter and group the data.
Another common data modeling design pattern is the snowflake schema. A snowflake schema is a variation of the star schema that is used to model more complex data relationships. In a snowflake schema, the dimension tables are normalized to reduce data redundancy.
Design patterns can be used to create data models for a variety of different purposes, including:
- Data warehousing
- Business intelligence
- Machine learning
- Data science
By applying a design pattern, business intelligence professionals can create data models that are efficient, scalable, and easy to maintain.
Which of the following statements accurately describe primary keys? Select all that apply.
- Only one primary key can exist in a table.
- A primary key uniquely identifies each record in a table.
- Primary keys can appear as foreign keys in other tables.
AI assistant
The following statements accurately describe primary keys:
- Only one primary key can exist in a table.
- A primary key uniquely identifies each record in a table.
- Primary keys can appear as foreign keys in other tables.
Primary keys are essential for maintaining data integrity in a database. By ensuring that each record has a unique identifier, primary keys help to prevent duplicate data and ensure that data relationships are correctly enforced.
Primary keys can be either a single column or a composite key, which is made up of multiple columns. Composite keys are often used to uniquely identify records in tables that have a one-to-many relationship with other tables.
For example, an Orders
table might have a composite primary key that consists of the OrderID
and CustomerID
columns. This would ensure that each order is uniquely identified, even if two different customers place orders for the same product.
Primary keys can also be used to create foreign keys in other tables. A foreign key is a column in one table that references the primary key of another table. Foreign keys are used to establish relationships between tables and to ensure that data is consistent across the database.
For example, an OrderItems
table might have a foreign key that references the OrderID
primary key of the Orders
table. This would ensure that each order item is associated with a valid order.
Primary keys are an important part of database design and help to ensure the integrity and consistency of data.
In a dimensional model, what might dimensions represent? Select all that apply.
Attributes, Supporting details
AI assistant
Dimensions in a dimensional model represent the supporting details of the events or facts being measured. They can be anything that provides context for the facts, such as:
- Date: The date and time of the event
- Product: The product or service that was purchased
- Customer: The customer who made the purchase
- Location: The location where the purchase was made
- Channel: The channel through which the purchase was made (e.g., online, in-store, etc.)
Dimensions are used to filter, group, and analyze the facts in a dimensional model. For example, you could use the Date
dimension to analyze sales by day, month, or year. Or, you could use the Product
dimension to analyze sales by product category or individual product.
Dimensions can also be used to create hierarchies. For example, the Date
dimension could have a hierarchy that includes year, month, and day. This would allow you to drill down from a high-level overview of sales to a more detailed analysis of sales by day.
Dimensions are an essential part of dimensional modeling. They provide the context that is needed to understand and analyze the facts in a data warehouse.
Here are some examples of dimensions in a dimensional model:
- E-commerce: Product, customer, location, order date, channel
- Financial services: Account, customer, transaction type, transaction date, location
- Telecommunications: Customer, phone number, call type, call date, location
Dimensions can be combined to create more complex analyses. For example, you could combine the Product
and Customer
dimensions to analyze sales by product category and customer segment. Or, you could combine the Date
and Location
dimensions to analyze sales trends over time and across different regions.
Dimensions are a powerful tool for data analysis. By understanding the different types of dimensions and how to use them, you can gain valuable insights into your business data.
Fill in the blank: In a dimensional model, a foreign key is used to connect a _ table to the appropriate fact table.
dimension
AI assistant
The answer is dimension table.
In a dimensional model, a foreign key is used to connect a dimension table to the appropriate fact table.
A fact table contains the quantitative data that you want to measure, such as sales, revenue, or units sold. Dimension tables contain the qualitative data that provides context for the facts, such as product, customer, date, and location.
Foreign keys are used to link the dimension tables to the fact table. This allows you to filter, group, and analyze the facts by the different dimensions.
For example, the following diagram shows a simple dimensional model for sales data:
Fact table: Sales
Dimension tables:
Product
Customer
Date
Location
Foreign keys:
Sales.ProductKey REFERENCES Product.ProductID
Sales.CustomerKey REFERENCES Customer.CustomerID
Sales.DateKey REFERENCES Date.DateID
Sales.LocationKey REFERENCES Location.LocationID
In this model, the Sales
fact table contains the sales data, such as the product sold, the customer who purchased it, the date of the sale, and the location of the sale. The dimension tables contain the supporting details for these facts, such as the product name, customer name, date, and location name.
The foreign keys are used to link the dimension tables to the fact table. For example, the Sales.ProductKey
column references the Product.ProductID
column. This means that each row in the Sales
table has a corresponding row in the Product
table.
This allows you to filter, group, and analyze the sales data by product. For example, you could use the ProductKey
foreign key to filter the Sales
table to only include sales of a specific product. Or, you could use the ProductKey
foreign key to group the Sales
table by product category and calculate the total sales for each product category.
Foreign keys are an essential part of dimensional modeling. They allow you to link the dimension tables to the fact table and create a comprehensive data model that can be used to analyze your business data from multiple perspectives.
How many fact tables exist in a star schema?
One
AI assistant
There is one fact table in a star schema. The fact table is at the center of the star schema and is surrounded by dimension tables. The dimension tables contain the descriptive attributes of the data, such as product, customer, and date. The fact table contains the quantitative measures, such as sales, revenue, and units sold.
The star schema is a simple yet powerful data modeling technique that is widely used in data warehouses and business intelligence systems. It is easy to understand and implement, and it provides a good foundation for data analysis.
Here is an example of a star schema for sales data:
Fact table: Sales
Dimension tables:
Product
Customer
Date
Location
Foreign keys:
Sales.ProductKey REFERENCES Product.ProductID
Sales.CustomerKey REFERENCES Customer.CustomerID
Sales.DateKey REFERENCES Date.DateID
Sales.LocationKey REFERENCES Location.LocationID
In this example, the Sales
fact table contains the sales data, such as the product sold, the customer who purchased it, the date of the sale, and the location of the sale. The dimension tables contain the supporting details for these facts, such as the product name, customer name, date, and location name.
The foreign keys are used to link the dimension tables to the fact table. This allows you to filter, group, and analyze the sales data by the different dimensions. For example, you could use the ProductKey
foreign key to filter the Sales
table to only include sales of a specific product. Or, you could use the ProductKey
foreign key to group the Sales
table by product category and calculate the total sales for each product category.
The star schema is a valuable tool for data analysis. It allows you to create a comprehensive data model that can be used to analyze your business data from multiple perspectives.
A business intelligence team wants to improve the state of their database schemas. While working toward this goal, they move data from one source platform to another target database. What process does this situation describe?
Database migration
AI assistant
The process that the business intelligence team is performing is called database migration. Database migration is the process of moving data from one source database to another target database. This can be done for a variety of reasons, such as to improve performance, security, or scalability.
Database migration can be a complex process, and it is important to carefully plan and execute the migration to avoid any data loss or corruption. There are a number of different database migration tools and services available, and the best approach to use will depend on the specific needs of the organization.
Here are some of the benefits of database migration:
- Improved performance: Newer database technologies may offer better performance than older technologies. Migrating to a newer database platform can improve the speed and responsiveness of your applications.
- Increased security: Newer database technologies may also offer better security features than older technologies. Migrating to a newer database platform can help to protect your data from unauthorized access and attack.
- Improved scalability: Newer database technologies may be able to scale more easily to meet the needs of growing businesses. Migrating to a newer database platform can help to ensure that your database can handle increased traffic and data volumes.
- Reduced costs: Migrating to a cloud-based database platform can help to reduce the costs of maintaining and operating your database infrastructure.
If you are considering migrating your database, it is important to work with a qualified database administrator to ensure that the migration is successful.
In row-based databases, each row in a table is an instance or an entry in the database. How are details about that instance recorded and organized?
By column
AI assistant
Details about an instance in a row-based database are recorded and organized by column. This means that all of the data for a specific column is stored together, regardless of the row in which it appears.
For example, consider a database table with the following columns:
Column | Type | Description |
---|---|---|
ProductID | Integer | The unique identifier of the product |
ProductName | String | The name of the product |
Price | Decimal | The price of the product |
In a row-based database, all of the ProductIDs would be stored together, followed by all of the ProductNames, followed by all of the Prices. This makes it easy to retrieve all of the data for a specific column, such as all of the ProductNames or all of the Prices.
Row-based databases are the most common type of database, and they are used for a wide variety of applications. They are efficient for storing and retrieving large amounts of data, and they are easy to use and manage.
Here is an example of how data is organized in a row-based database:
ProductID | ProductName | Price
------- | ----------- | --------
1 | iPhone 13 | 999
2 | MacBook Air | 1299
3 | iPad Pro | 799
Each row in the table represents an instance of a product. The data for each column is stored together, regardless of the row in which it appears. For example, all of the ProductIDs are stored together, followed by all of the ProductNames, followed by all of the Prices.
This makes it easy to retrieve all of the data for a specific column, such as all of the ProductNames or all of the Prices. For example, to retrieve all of the ProductNames, you would simply query the ProductName
column.
Row-based databases are an efficient and easy-to-use way to store and retrieve large amounts of data. They are the most common type of database, and they are used for a wide variety of applications.
A business intelligence team is working with a database system in which relevant data is stored locally and less relevant data is stored remotely. What type of database system are they using?
Separated storage and computing system
AI assistant
The type of database system that the business intelligence team is working with is a distributed database system. A distributed database system is a type of database system that stores data across multiple physical servers. This can be done to improve performance, scalability, and reliability.
In a distributed database system, the data is partitioned across the different servers. The partitioning can be done based on a variety of factors, such as the type of data, the frequency of access, or the geographic location of the users.
Distributed database systems are often used by large organizations that have large volumes of data. They can also be used by organizations that need to provide access to data to users in different geographic locations.
Here are some of the benefits of using a distributed database system:
- Improved performance: Distributed database systems can improve performance by distributing the workload across multiple servers. This can be especially beneficial for applications that need to process large amounts of data.
- Increased scalability: Distributed database systems can be scaled easily to meet the needs of growing businesses. Simply adding more servers to the cluster can increase the capacity of the database system.
- Improved reliability: Distributed database systems are more reliable than traditional centralized database systems. If one server fails, the other servers can continue to operate, ensuring that the database is still available.
However, distributed database systems can also be more complex and expensive to manage than traditional centralized database systems.
Here is an example of how a distributed database system might be used:
A large retail company has a distributed database system that stores data about its products, customers, and sales. The data is partitioned across multiple servers based on the geographic location of the customers. This allows the company to provide fast and reliable access to data to its customers, regardless of where they are located.
Distributed database systems are a powerful tool that can be used to improve the performance, scalability, and reliability of database systems. However, it is important to carefully consider the needs of your organization before implementing a distributed database system.
Fill in the blank: A database schema must describe _ because this is necessary when users want to understand how the data is shaped and the relationships within the database.
relevant data
AI assistant
The answer is data. A database schema must describe data because this is necessary when users want to understand how the data is shaped and the relationships within the database.
A database schema is a blueprint of a database. It describes the logical and physical structure of the database, including the tables, columns, data types, and relationships between tables.
The database schema is important for a number of reasons. First, it helps to ensure the integrity of the data. By defining the data types and relationships between tables, the database schema helps to prevent invalid data from being entered into the database.
Second, the database schema makes it easier for users to understand and use the database. By providing a detailed description of the data, the database schema helps users to understand how the data is organized and how to access it.
Finally, the database schema is essential for database administration tasks. For example, the database schema is used to create and maintain database objects, such as tables, indexes, and views.
Here are some of the benefits of having a well-defined database schema:
- Improved data quality: The database schema helps to ensure the integrity of the data by defining the data types and relationships between tables. This helps to prevent invalid data from being entered into the database.
- Increased data accessibility: The database schema makes it easier for users to understand and use the database by providing a detailed description of the data. This helps users to understand how the data is organized and how to access it.
- Reduced database administration costs: The database schema is essential for database administration tasks, such as creating and maintaining database objects. This can help to reduce the time and cost of database administration.
Overall, a well-defined database schema is essential for ensuring the integrity, accessibility, and manageability of data.
In the ETL loading stage, what are typical target destinations to which the data might be delivered? Select all that apply.
Data warehouse, Analytics platform
AI assistant
Typical target destinations for the data loading stage of ETL include:
- Data lake
- Data warehouse
- Website application
- Analytics platform
Data lakes are a type of data storage that can store large amounts of unstructured and semi-structured data. Data warehouses are a type of data storage that is optimized for analytical queries. Website applications can be used to display data to users. Analytics platforms can be used to analyze data and generate insights.
Here is a brief description of each target destination:
- Data lake: A data lake is a centralized repository that stores all of an organization’s data in its native format. This includes structured, semi-structured, and unstructured data. Data lakes are typically used for big data analytics and machine learning.
- Data warehouse: A data warehouse is a system that stores data in a structured format that is optimized for analytical queries. Data warehouses are typically used for business intelligence and reporting purposes.
- Website application: A website application is a web-based application that can be used to display data to users. Website applications are often used to create dashboards and reports that can be accessed by users from any location.
- Analytics platform: An analytics platform is a software application that can be used to analyze data and generate insights. Analytics platforms typically provide a variety of features, such as data visualization, data mining, and machine learning.
The specific target destination that is chosen for the ETL loading stage will depend on the specific needs of the organization. For example, if the organization needs to store large amounts of unstructured data, then a data lake may be the best choice. If the organization needs to perform complex analytical queries, then a data warehouse may be the best choice. If the organization needs to display data to users, then a website application may be the best choice. And if the organization needs to analyze data and generate insights, then an analytics platform may be the best choice.
It is also possible to load data to multiple target destinations. For example, an organization may load data to a data lake for long-term storage and to a data warehouse for analytical queries.
Fill in the blank: Python is a programming language that is _, which means it’s modeled around chunks of code that capture certain information.
object-oriented
AI assistant
The answer is object-oriented.
Python is an object-oriented programming language, which means that it’s modeled around chunks of code that capture certain information, or objects. Objects can represent anything from a person to a product to a mathematical function.
Object-oriented programming has a number of benefits, including:
- Modularity: Objects can be reused and combined in different ways to create new programs. This makes it easier to develop and maintain complex programs.
- Data encapsulation: Objects can hide their internal data from other objects, which helps to protect the data and prevent errors.
- Polymorphism: Objects of different types can respond to the same message in different ways. This makes it possible to write code that is more flexible and reusable.
Python is a popular choice for object-oriented programming because it is easy to learn and use. It also has a number of features that make it well-suited for object-oriented development, such as its support for multiple inheritance and its dynamic typing system.
Here is an example of a simple object-oriented program in Python:
Python
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def greet(self):
print(f"Hello, my name is {self.name} and I am {self.age} years old.")
# Create a new Person object
bob = Person("Bob", 40)
# Call the greet() method on the bob object
bob.greet()
This program defines a class called Person
with two attributes, name
and age
. It also defines a method called greet()
, which prints a greeting to the console.
To create a new Person object, we use the Person()
constructor. We can then call the greet()
method on the object to print a greeting to the console.
Object-oriented programming is a powerful programming paradigm that can be used to develop complex and efficient programs. Python is a popular choice for object-oriented programming because it is easy to learn and use, and it has a number of features that make it well-suited for object-oriented development.