You’ll learn more about database systems, including data marts, data lakes, data warehouses, and ETL processes. You’ll also investigate the five factors of database performance: workload, throughput, resources, optimization, and contention. Finally, you’ll consider how to design efficient queries that get the most from a system.
Learning Objectives
- Discover strategies to create an ETL process that works to meet organizational and stakeholder needs and maintain an ETL process efficiently.
- Understand what the different data storage and extraction processes and tools may include (Extract/L: Stitch/Segment/Fivetran, Transform: DBT/Airflow/Looker).
- Explain how to optimize when building new tables.
- Identify and describe where new tables can fit in the pipeline.
- Recognize the different aspects of databases, including OLAP and OLTP, columnar and relational, distributed and single-homed databases.
- Understand the importance of database performance and optimization.
- Describe the different five factors of database performance: workload, throughput, resources, optimization, and contention.
- Perform pipeline debugging using queries.
- Database performance
- Video: Welcome to module 2
- Video: Data marts, data lakes, and the ETL process
- Reading: ETL versus ELT
- Video: The five factors of database performance
- Reading: A guide to the five factors of database performance
- Video: Optimize database performance
- Reading: Indexes, partitions, and other ways to optimize
- Practice Quiz: Activity: Partition data and create indexes in BigQuery
- Reading: Activity Exemplar: Partition data and create indexes in BigQuery
- Reading: Case study: Deloitte – Optimizing outdated database systems
- Video: The five factors in action
- Reading: Determine the most efficient query
- Upgraded Plugin: Design: Optimize for database speed
- Practice Quiz: Test your knowledge: Database performance
- Review: Dynamic database design
Database performance
Video: Welcome to module 2
This video series will explore how to improve the performance of data pipelines by understanding and optimizing database queries. This will allow you to deliver the most up-to-date information to your stakeholders more efficiently.
The series will cover the following topics:
- How to increase throughput and minimize resource contention
- Database systems, including data marts, data lakes, data warehouses, and ELT processes
- The five factors of database performance: workload, throughput, resources, optimization, and contention
- Tips for improving database intake and storage
- How to design efficient queries
By the end of the series, you will be able to:
- Optimize database queries to improve the performance of data pipelines
- Choose the right database system for your needs
- Understand the factors that affect database performance
- Improve database intake and storage
- Design efficient queries that get the most out of your systems
In order to efficiently deliver
the most up to date information to your stakeholders, you must first understand and optimize
query performance within your pipelines. And that’s we’re going to
explore the next few videos. We’ll discover how to increase throughput
and minimize the competition for resources within the system to enable the
largest possible workload to be processed. We’ll get into database
systems including data marts, data lakes, data warehouses and
ELT processes. That’s not a typo,
ELT is different from ETL. It stands for extract, load and transform. You’ll also witness how these systems
contribute to the overall efficiency of your data systems. In addition, you’ll investigate the five
factors of database performance, workload, throughput, resources,
optimization and contention. And you’ll gain some tips for
making sure your database intake and storage are the best they can be. Finally, we’ll begin thinking about how to
design efficient queries that really get the most out of your systems. Let’s do it.
Video: Data marts, data lakes, and the ETL process
In addition to data warehouses, there are other data storage and processing patterns that BI professionals may encounter, such as data marts, data lakes, and ELT processes.
Data marts: Data marts are subject-oriented databases that can be a subset of a larger data warehouse. They are useful for accessing the relevant data that needs to be pulled for a particular project.
Data lakes: Data lakes are database systems that store large amounts of raw data in its original format until it’s needed. This makes the data easily accessible, because it doesn’t require a lot of processing.
ELT processes: ELT stands for Extract, Load, and Transform. It is a type of data pipeline that enables data to be gathered from different sources, usually data lakes, then loaded into a unified destination system and transformed into a useful format.
These new technologies and processes offer a number of advantages over traditional data warehouses, such as:
- Increased flexibility and scalability
- Reduced storage costs
- Faster data processing
- Ability to handle a wider variety of data types
BI professionals who are curious and lifelong learners will be well-positioned to take advantage of these new technologies and processes to deliver better insights to their stakeholders.
Data marts, data lakes, and the ETL process
Data marts
A data mart is a subset of a data warehouse that is focused on a specific business area or department. For example, a company might have a data mart for its sales team, its marketing team, and its finance team.
Data marts are typically smaller and more focused than data warehouses, which makes them faster and easier to query. They are also often more affordable to build and maintain.
Data lakes
A data lake is a central repository that stores all of a company’s data, regardless of its format or structure. Data lakes can store structured data, such as relational database tables, as well as unstructured data, such as images, videos, and text files.
Data lakes are often used to store raw data that has not yet been processed or analyzed. This data can then be analyzed using a variety of tools and techniques, such as machine learning and artificial intelligence.
The ETL process
The ETL process is a three-step process for extracting data from source systems, transforming it into a consistent format, and loading it into a target system.
- Extract: The first step is to extract the data from the source systems. This may involve connecting to the source systems and querying the data, or it may involve exporting the data from the source systems into files.
- Transform: The second step is to transform the data into a consistent format. This may involve cleaning the data, removing errors, and converting the data to a common data model.
- Load: The third step is to load the data into the target system. This may involve connecting to the target system and inserting the data into tables, or it may involve importing the data into the target system from files.
The ETL process is important for ensuring that data is accurate, consistent, and accessible for analysis.
Conclusion
Data marts, data lakes, and the ETL process are all important components of modern data warehouses. Data marts provide fast and easy access to data for specific business areas or departments. Data lakes provide a central repository for all of a company’s data, regardless of its format or structure. And the ETL process ensures that data is accurate, consistent, and accessible for analysis.
When to use data marts
Data marts are a good choice for organizations that:
- Need to provide fast and easy access to data for specific business areas or departments.
- Have limited resources to build and maintain a data warehouse.
- Need to comply with data privacy or security regulations.
When to use data lakes
Data lakes are a good choice for organizations that:
- Need to store a large volume of data, including unstructured data.
- Need to perform complex analytics on their data.
- Need to be able to scale their data storage and processing capabilities quickly and easily.
When to use the ETL process
The ETL process should be used by any organization that needs to extract data from multiple source systems, transform it into a consistent format, and load it into a target system. This includes organizations that are using data marts, data lakes, or traditional data warehouses.
Choosing the right technology
The best technology for your organization will depend on your specific needs and requirements. If you are not sure which technology is right for you, it is a good idea to consult with a data expert.
Fill in the blank: A data lake is a database system that stores large amounts of _____ in its original format until it’s needed.
raw data
A data lake is a database system that stores large amounts of raw data in its original format until it’s needed. While the raw data has been tagged to be identifiable, it is not organized.
What is the term for a pipeline that extracts, loads, then transforms the data?
ELT
ELT is a pipeline that extracts, loads, then transforms the data. It enables data to be gathered from data lakes, loaded into a unified destination system, and transformed into a useful format.
One of the amazing things about BI,
is that the tools and processes are constantly evolving. Which means BI professionals always
have new opportunities to build and improve current systems. So, let’s learn about some other
interesting data storage and processing patterns you might
encounter as a BI professional. Throughout these courses, we’ve learned
about database systems that make use of data warehouses for their storage needs. As a refresher, a data warehouse is
a specific type of database that consolidates data from multiple
source systems for data consistency, accuracy and efficient access. Basically, a data warehouse is a huge
collection of data from all the company’s systems. Data warehouses were really common
when companies used a single machine, to store and
compute their relational databases. However, with the rise of cloud
technologies and explosion of data volume, new patterns for data storage and
computation emerged. One of these tools is a data mart,
as you may recall a data mart is a subject oriented database that can be
a subset of a larger data warehouse. NBI, subject oriented, describes something
that is associated with specific areas or departments of a business such as finance,
sales or marketing. As you’re learning, BI projects commonly
focus on answering various questions for different teams. So a data mart is a convenient way to
access the relevant data that needs to be pulled for a particular project. Now, let’s check out data lakes. A data lake is a database system that
stores large amounts of raw data in its original format until it’s needed. This makes the data easily accessible, because it doesn’t require
a lot of processing. Like a data warehouse, a data lake
combines many different sources, but data warehouses are hierarchical with
files and folders to organize the data. Whereas data lakes are flat and while data
has been tagged so it is identifiable, it’s not organized, it’s fluid,
which is why it’s called a data lake. Data lakes don’t require the data
to be transformed before storage. So they are useful if your BI system is
ingesting a lot of different data types. But of course, the state eventually
needs to get organized and transformed. One way to integrate data lakes
into a data system is through ELT previously we learned
about the ETL process, where data is extracted from
the source into the pipeline. Transformed, while it is being transported
and then loaded into its destination. ELT takes the same steps but
reorganizes them so that the pipeline Extracts,
Loads and then Transforms the data. Basically ELT is a type of data pipeline
that enables data to be gathered from different sources. Usually data lakes, then loaded into
a unified destination system and transformed into a useful format. ELT enables BI professionals to ingest so
many different kinds of data into a storage system as soon
as that data is available. And they only have to transform the data
they need, ELT also reduces storage costs and enables businesses to scale storage
and computation resources independently. As technology advances, the processes and tools available
also advance and that’s great. Some of the most successful BI
professionals do well because they are curious lifelong learners.
Reading: ETL versus ELT
Reading
So far in this course, you have learned about ETL pipelines that extract, transform, and load data between database storage systems. You have also started learning about newer pipeline systems like ELT pipelines that extract, load, and then transform data. In this reading, you are going to learn more about the differences between these two systems and the ways different types of database storage fit into those systems. Understanding these differences will help you make key decisions that promote performance and optimization to ensure that your organization’s systems are efficient and effective.
The primary difference between these two pipeline systems is the order in which they transform and load data. There are also some other key differences in how they are constructed and used:
Differences | ETL | ELT |
---|---|---|
The order of extraction, transformation, and loading data | Data is extracted, transformed in a staging area, and loaded into the target system | Data is extracted, loaded into the target system, and transformed as needed for analysis |
Location of transformations | Data is moved to a staging area where it is transformed before delivery | Data is transformed in the destination system, so no staging area is required |
Age of the technology | ETL has been used for over 20 years, and many tools have been developed to support ETL pipeline systems | ELT is a newer technology with fewer support tools built-in to existing technology |
Access to data within the system | ETL systems only transform and load the data designated when the warehouse and pipeline are constructed | ELT systems load all of the data, allowing users to choose which data to analyze at any time |
Calculations | Calculations executed in an ETL system replace or revise existing columns in order to push the results to the target table | Calculations are added directly to the existing dataset |
Compatible storage systems | ETL systems are typically integrated with structured, relational data warehouses | ELT systems can ingest unstructured data from sources like data lakes |
Security and compliance | Sensitive information can be redacted or anonymized before loading it into the data warehouse, which protects data | Data has to be uploaded before data can be anonymized, making it more vulnerable |
Data size | ETL is great for dealing with smaller datasets that need to undergo complex transformations | ELT is well-suited to systems using large amounts of both structured and unstructured data |
Wait times | ETL systems have longer load times, but analysis is faster because data has already been transformed when users access it | Data loading is very fast in ELT systems because data can be ingested without waiting for transformations to occur, but analysis is slower |
Data storage systems
Because ETL and ELT systems deal with data in slightly different ways, they are optimized to work with different data storage systems. Specifically, you might encounter data warehouses and data lakes. As a refresher, a data warehouse is a type of database that consolidates data from multiple source systems for data consistency, accuracy, and efficient access. And a data lake is a database system that stores large amounts of raw data in its original format until it’s needed. While these two systems perform the same basic function, there are some key differences:
Data warehouse | Data lake |
---|---|
Data has already been processed and stored in a relational system | Data is raw and unprocessed until it is needed for analysis; additionally, it can have a copy of the entire OLTP or relational database |
The data’s purpose has already been assigned, and the data is currently in use | The data’s purpose has not been determined yet |
Making changes to the system can be complicated and require a lot of work | Systems are highly accessible and easy to update |
There is also a specific type of data warehouse you might use as a data source: data marts. Data marts are very similar to data warehouses in how they are designed, except that they are much smaller. Usually, a data mart is a single subset of a data warehouse that covers data about a single subject.
Key takeaways
Currently, ETL systems that extract, transform and load data, and ELT systems that extract, load, and then transform data are common ways that pipeline systems are constructed to move data where it needs to go. Understanding the differences between these systems can help you recognize when you might want to implement one or the other. And, as business and technology change, there will be a lot of opportunities to engineer new solutions using these data systems to solve business problems.
Video: The five factors of database performance
Database performance is a measure of the workload that can be processed by a database, as well as the associated costs. The factors that influence database performance are:
- Workload: The combination of transactions, queries, analysis, and system commands being processed by the database system at any given time.
- Throughput: The overall capability of the database’s hardware and software to process requests.
- Resources: The hardware and software tools available for use in a database system, such as disk space and memory.
- Optimization: Maximizing the speed and efficiency with which data is retrieved in order to ensure high levels of database performance.
- Contention: When two or more components attempt to use a single resource in a conflicting way.
Understanding these factors can help you to improve the performance of your database system.
Tutorial on “The five factors of database performance” in Business Intelligence
Database performance is a critical factor in any business intelligence (BI) system. When a database is performing well, users can get the information they need quickly and easily. This can lead to better decision-making and improved business outcomes.
There are five factors that influence database performance:
- Workload: The workload of a database is the combination of transactions, queries, analysis, and system commands that are being processed at any given time. The workload can fluctuate depending on the time of day, week, or month. For example, the workload may be higher at the end of the month when reports are being run.
- Throughput: Throughput is the rate at which a database can process requests. It is measured in transactions per second (TPS). Throughput is affected by the hardware and software of the database system, as well as the workload.
- Resources: Resources are the hardware and software tools that are available to the database system. This includes the CPU, memory, disk space, and network bandwidth. Resources can affect throughput and performance.
- Optimization: Optimization is the process of tuning the database system to improve performance. This can include things like creating indexes, partitioning tables, and using caching.
- Contention: Contention occurs when two or more components of the database system are trying to access the same resource at the same time. This can lead to performance degradation.
How to improve database performance
There are a number of things that you can do to improve database performance. Here are a few tips:
- Understand the workload: The first step to improving database performance is to understand the workload. This includes identifying the most common types of queries and transactions that are being processed. Once you understand the workload, you can start to optimize the database system for those specific tasks.
- Tune the database: Tuning the database is another important way to improve performance. This involves adjusting the configuration of the database system to improve performance. For example, you may need to create indexes or partition tables.
- Monitor the database: It is important to monitor the database system on a regular basis to identify any potential performance problems. This can be done using a variety of tools and techniques.
- Upgrade the hardware: If the database system is overloaded, you may need to upgrade the hardware. This can include adding more CPU, memory, or disk space.
Conclusion
By understanding the five factors that influence database performance and taking steps to improve performance, you can ensure that your BI system is able to meet the needs of your users.
A database is performing slowly because multiple components are attempting to use the same piece of data at the same time. Which of the factors of database performance should be addressed?
Contention
The factor of contention should be addressed. Contention occurs when two or more components attempt to use a single resource in a conflicting way.
We’ve been investigating
database optimization and why it’s important to make sure that users are
able to get what they need from the system as
efficiently as possible. Successful optimization can be measured by the
database performance. Database performance
is a measure of the workload that can be
processed by a database, as well as the associated costs. In this video, we’re
going to consider the factors that influence
database performance, workload, throughput, resources, optimization,
and contention. First, we’ll start
with workload. In BI, workload refers to the combination of
transactions, queries, analysis, and system
commands being processed by the database
system at any given time. It’s common for a
database’s workload to fluctuate drastically
from day to day, depending on what jobs
are being processed and how many users are
interacting with the database. The good news is that you can often predict these
fluctuations. For instance, there might be a higher workload at the end
of the month when reports are being processed
or the workload might be really light
right before a holiday. Next, we have throughput. Throughput is the
overall capability of the database’s hardware and
software to process requests. Throughput is made up of
the input and output speed, the central processor
unit speed, how well the machine can
run parallel processes, the database management system, and the operating system
and system software. Basically, throughput describes a workload size that
the system can handle. Let’s get into resources. In BI, resources are the hardware and software tools available for use in
a database system. This includes the disk
space and memory. Resources are a big part of a database system’s ability to process requests
and handle data. They can also fluctuate, especially if the hardware or other dedicated resources are shared with
additional databases, software applications,
or services. Also, cloud-based systems are particularly prone
to fluctuation. It’s useful to remember that external factors can
affect performance. Now we come to optimization. Optimization involves maximizing the speed
and efficiency with which data is
retrieved in order to ensure high levels of
database performance. This is one of the most
important factors that BI Professionals return
to again and again. Coming up soon, we’re going to talk about it in more detail. Finally, the last factor of database performance
is contention. Contention occurs when
two or more components attempt to use a single
resource in a conflicting way. This can really
slow things down. For instance, if there are multiple processes trying to update the same piece of data, those processes
are in contention. As contention increases, the throughput of the
database decreases. Limiting contention as
much as possible will help ensure the database is
performing at its best. There you have five factors
of database performance, workload, throughput, resources, optimization,
and contention. Coming up, we’re
going to check out an example of these
factors in action so you can understand
more about how each contributes to
database performance.
Reading: A guide to the five factors of database performance
Reading
Database performance is an important consideration for BI professionals. As you have been learning, database performance is a measure of the workload that can be processed by the database, as well as associated costs. Optimization involves maximizing the speed and efficiency that data is retrieved in order to ensure high levels of database performance. This means that your stakeholders have the fastest access to the data they need to make quick and intelligent decisions. You have also been learning that there are five factors of database performance: workload, throughput, resources, optimization, and contention.
The five factors
In this reading, you will be given a quick overview of the five factors that you can reference at any time and an example to help outline these concepts. In the example, you are a BI professional working with the sales team to gain insights about customer purchasing habits and monitor the success of current marketing campaigns.
Factor | Definition | Example |
---|---|---|
Workload | The combination of transactions, queries, data warehousing analysis, and system commands being processed by the database system at any given time. | On a daily basis, your database needs to process sales reports, perform revenue calculations, and respond to real-time requests from stakeholders.All of these needs represent the workload the database needs to be able to handle. |
Throughput | The overall capability of the database’s hardware and software to process requests. | The system’s throughput is the combination of input and output speed, the CPU speed, the machine’s ability to run parallel processes, the database management system, and the operating system and system software. |
Resources | The hardware and software tools available for use in a database system. | The database system is primarily cloud-based, which means it depends on online resources and software to maintain functionality. |
Optimization | Maximizing the speed and efficiency with which data is retrieved in order to ensure high levels of database performance. | Continually checking that the database is running optimally is part of your job as the team’s BI professional. |
Contention | When two or more components attempt to use a single resource in a conflicting way. | Because this system automatically generates reports and responds to user-requests, there are times when it may be trying to run the queries on the same datasets at the same time, causing slowdown for users. |
Video: Optimize database performance
Database optimization is the process of maximizing the speed and efficiency with which data is retrieved in order to ensure high levels of database performance.
BI professionals optimize databases by:
- Examining resource use to identify inefficient queries, indexes, partitions, data fragmentation, and memory/CPU constraints.
- Rewriting inefficient queries, creating new indexes, partitioning data appropriately, and defragmenting data.
- Ensuring that the database has the capacity to handle the organization’s demands.
By addressing these issues, BI professionals can improve database performance and make it easier for users to access the data they need.
Additional notes:
- Query plans can be used to identify steps in a query that are causing performance problems.
- Data partitioning is a common practice in cloud-based systems working with big data.
- Fragmented data can occur when data is broken up into many pieces that are not stored together.
- It is important to monitor database performance to ensure that the database is able to meet the needs of the organization.
Database performance is critical for business intelligence (BI) systems. When a database is performing well, users can get the information they need quickly and easily. This can lead to better decision-making and improved business outcomes.
Here are some tips on how to optimize database performance in BI:
- Understand the workload: The first step to optimizing database performance is to understand the workload. This includes identifying the most common types of queries and transactions that are being processed. Once you understand the workload, you can start to optimize the database system for those specific tasks.
- Tune the database: Tuning the database is another important way to improve performance. This involves adjusting the configuration of the database system to improve performance. For example, you may need to create indexes or partition tables.
- Monitor the database: It is important to monitor the database system on a regular basis to identify any potential performance problems. This can be done using a variety of tools and techniques.
- Upgrade the hardware: If the database system is overloaded, you may need to upgrade the hardware. This can include adding more CPU, memory, or disk space.
Here are some additional tips for optimizing database performance in BI:
- Use efficient queries: When writing queries, try to use the most efficient methods possible. This includes using indexes, avoiding unnecessary subqueries, and using the appropriate data types.
- Partition data: Partitioning data can improve performance by dividing the data into smaller, more manageable chunks. This can be especially helpful for large datasets.
- Use caching: Caching can improve performance by storing frequently accessed data in memory. This can reduce the number of times the database has to be accessed.
- Use a database management system (DBMS) that is designed for BI: Some DBMSs are specifically designed for BI workloads. These DBMSs typically have features that can improve performance, such as columnar storage and in-memory analytics.
By following these tips, you can improve database performance and make it easier for BI users to access the information they need.
Here is an example of how to optimize database performance in BI:
Suppose you have a BI system that is used to generate sales reports. The reports are based on a large dataset of sales transactions. The queries that are used to generate the reports are slow, and users are complaining that it takes too long to get the reports they need.
To optimize database performance, you can start by analyzing the workload. Identify the most common types of queries that are being used to generate the reports. Once you have identified the most common queries, you can start to optimize them.
For example, you may need to create indexes on the tables that are used in the most common queries. You may also need to partition the data so that the queries can run more efficiently.
In addition to optimizing the queries, you can also improve performance by tuning the database. For example, you may need to adjust the memory allocation for the database or increase the number of worker threads.
Finally, you should monitor the database performance on a regular basis to identify any potential problems. You can use a variety of tools to monitor the database, such as the database’s own performance monitoring tools or third-party monitoring tools.
By following these steps, you can optimize database performance and improve the performance of your BI system.
What is the process of dividing a database into distinct, logical parts in order to improve query processing and increase manageability?
Data partitioning
Data partitioning is the process of dividing a database into distinct, logical parts in order to improve query processing and increase manageability. Ensuring data is partitioned appropriately is a key part of database performance optimization.
Recently, we’ve been learning a lot about database
performance. As a refresher, this is a measure of the
workload that can be processed by the database as
well as associated costs. We also explored optimization, which is one of the
most important factors of database performance. You recall that optimization involves maximizing
the speed and efficiency with which
data is retrieved in order to ensure high levels
of database performance. In this video, we’re going to focus on optimization and how BI professionals optimized
databases by examining resource use and identifying better data sources
and structures. Again, the goal is to enable
the system to process the largest possible workload at the most reasonable cost. This requires a
speedy response time, which is how long it takes
for a database to respond to a user request.
Here’s an example. Imagine you’re a BI professional receiving emails from
people on your team who say that it’s taking
longer than usual for them to pull the data
they need from the database. At first, this seems like a
pretty minor inconvenience, but a slow database can be disruptive and cost your
team a lot of time. If they have to stop and
wait whenever they need to pull data or
perform a calculation, it really affects their work. There are a few reasons that users might be
encountering this issue. Maybe the queries aren’t
fully optimized or the database isn’t properly
indexed or partitioned. Perhaps the data is fragmented, where there isn’t
enough memory or CPU. Let’s examine each of these. First, if the queries users are writing to interact with the
database are inefficient, it can actually slow down
your database resources. To avoid this, the
first step is to simply revisit the queries to ensure they’re as
efficient as possible. The next step is to
consider the query plan. In a relational database
system that uses SQL, a query plan is a
description of the steps the database system takes in
order to execute a query. As you’ve learned, a query
tells a system what to do, but not necessarily
how to do it. The query plan is the how. If queries are running slowly, checking the query plan to
find out if there are steps causing more draw than
necessary can be helpful. This is another
iterative process. After checking the query plan, you might rewrite
the query or create new tables and then check
the query plan again. Now let’s consider indexing. An index is an
organizational tag used to quickly locate data
within a database system. If the tables within a database haven’t been fully indexed, it can take the database
longer to locate resources. In Cloud-based systems
working with big data, you might have data partitions
instead of indexes. Data partitioning is the process of dividing a database into distinct logical
parts in order to improve query processing
and increase manageability. The distribution of data within the system is
extremely important. Ensuring that data has been partitioned appropriately
and consistently, is part of optimization too. The next issue is
fragmented data. Fragmented data
occurs when data is broken up into many pieces
that are not stored together. Often as a result of using the data frequently or creating, deleting, or modifying files. For example, if you are
accessing the same data often and versions of it are
being saved in your cache, those versions are actually causing fragmentation
in your system. Finally, if your
database is having trouble keeping up with your
organization’s demands, it might mean there
isn’t enough memory available to process
everyone’s requests. Making sure your database
has the capacity to handle everything you
ask of it it’s critical. Consider our example again. You received some emails from the team stating that it was taking longer than usual to
access data from database. After learning about the
slowdown from your team, you were able to assess the situation and
make some fixes. Addressing the issues
allowed you to ensure the database was working as efficiently as possible
for your team. Problem-solved. But
database optimization is an ongoing process
and you’ll need to continue to monitor performance to keep everything
running smoothly.
Reading: Indexes, partitions, and other ways to optimize
Reading
Optimization for data reading
One of the continual tasks of a database is reading data. Reading is the process of interpreting and processing data to make it available and useful to users. As you have been learning, database optimization is key to maximizing the speed and efficiency with which data is retrieved in order to ensure high levels of database performance. Optimizing reading is one of the primary ways you can improve database performance for users. Next, you will learn more about different ways you can optimize your database to read data, including indexing and partitioning, queries, and caching.
Indexes
Sometimes, when you are reading a book with a lot of information, it will include an index at the back of the book where that information is organized by topic with page numbers listed for each reference. This saves you time if you know what you want to find– instead of flipping through the entire book, you can go straight to the index, which will direct you to the information you need.
Indexes in databases are basically the same– they use the keys from the database tables to very quickly search through specific locations in the database instead of the entire thing. This is why they’re so important for database optimization– when users run a search in a fully indexed database, it can return the information so much faster. For example, a table with columns ID, Name, and Department could use an index with the corresponding names and IDs.
Now the database can easily locate the names in the larger table quickly for searches using those IDs from the index.
Partitions
Data partitioning is another way to speed up database retrieval. There are two types of partitioning: vertical and horizontal. Horizontal partitioning is the most common, and involves designing the database so that rows are organized by logical groupings instead of stored in columns. The different rows are stored in different tables– this reduces the index size and makes it easier to write and retrieve data from the database.
Instead of creating an index table to help the database search through the data faster, partitions split larger, unwieldy tables into much more manageable, smaller tables.
In this example, the larger sales table is broken down into smaller tables– these smaller tables are easier to query because the database doesn’t need to search through as much data at one time.
Other optimization methods
In addition to making your database easier to search through with indexes and partitions, you can also optimize your actual searches for readability or use your system’s cached memory to save time retrieving frequently used data.
Queries
Queries are requests for data or information from a database. In many cases, you might have a collection of queries that you run regularly; these might be automated queries that generate reports, or regular searches made by users.
If these queries are not optimized, they can take a long time to return results to users and take up database resources in general. There a few things you can do to optimize queries:
- Consider the business requirements: Understanding the business requirements can help you determine what information you really need to pull from the database and avoid putting unnecessary strain on the system by asking for data you don’t actually need.
- Avoid using SELECT* and SELECT DISTINCT: Using SELECT* and SELECT DISTINCT causes the database to have to parse through a lot of unnecessary data. Instead, you can optimize queries by selecting specific fields whenever possible.
- Use INNER JOIN instead of subqueries: Using subqueries causes the database to parse through a large number of results and then filter them, which can take more time than simply JOINing tables in the first place.
Additionally, you can use pre-aggregated queries to increase database read functionality. Basically, pre-aggregating data means assembling the data needed to measure certain metrics in tables so that the data doesn’t need to be re-captured every time you run a query on it.
If you’re interested in learning more about optimizing queries, you can check out Devart’s article on SQL Query Optimization.
Caching
Finally, the cache can be a useful way to optimize your database for readability. Essentially, the cache is a layer of short-term memory where tables and queries can be stored. By querying the cache instead of the database system itself, you can actually save on resources. You can just take what you need from the memory.
For example, if you often access the database for annual sales reports, you can save those reports in the cache and pull them directly from memory instead of asking the database to generate them over and over again.
Key takeaways
This course has focused a lot on database optimization and how you, as a BI professional, can ensure that the systems and solutions you build for your team continue to function as efficiently as possible. Using these methods can be a key way for you to promote database speed and availability as team members access the database system. And coming up, you’re going to have opportunities to work with these concepts yourself!
Practice Quiz: Activity: Partition data and create indexes in BigQuery
Code
Part 1: Set up in BigQuery
Step 1: Access the data
To use the data for this course item, download the dataset from Kaggle – Avocado Sales 2015-2021 (US centric).
Step 2: Open the BigQuery console
Navigate to your BigQuery console
Step 3: Create a dataset
Step 4: Load the avocade data into a table
Next, use the Create table from the dropdown menu and select Upload. Choose the CSV file you downloaded earlier in this activity.
Then, name the table avocado_base.” Make sure the Dataset field reads “mydataset” and the Table type field reads “Native table.”
In the Schema section of the interface, check the box for Auto detect.
Then select Create table.
Part 2: Create tables with partitions and clusters
Step 1: Create a table without a partition or cluster
To begin, create a new table without a partition or cluster. This will serve as a baseline to compare to the partitioned and clustered tables. Name it avocados.
Then, in the Editor tab, copy and paste the following SQL code and click Run.
CREATE TABLE
`mydataset.avocados`
AS (
SELECT
*
FROM `mydataset.avocado_base`
);
When you finish running the code, switch to the Results tab. Click Go to table and take note of the Details pane. Save the details for later by taking a screenshot or copying and pasting the information into another document. The dates on your screen might differ, but the table size, long-term storage size, and number of rows should be the same as in the following image.
Step 2: Create a table with a partition
Next, create a table partitioned by an integer range (the years 2015 through 2022). Name it avocados_partitioned.
Return to the tab you entered the SQL code into. Delete that code then copy and paste the following SQL code. Click Run.
CREATE TABLE
`mydataset.avocados_partitioned`
PARTITION BY
RANGE_BUCKET(Year, GENERATE_ARRAY(2015,2022,1))
AS (
SELECT
*
FROM `mydataset.avocado_base`
);
When you finish running the code, switch to the Results tab. Click Go to table and take note of the Details pane. Save the details for later by taking a screenshot or copying and pasting the information into another document. After this activity, you’ll compare this to the exemplar.
Step 3: Create a table with a partition and a cluster
Next, create a table partitioned by an integer range and clustered by type. Name it avocados_clustered.
Return to the tab where you entered the SQL code. Delete that code, then copy and paste the following SQL code. Click Run.
CREATE TABLE
`mydataset.avocados_clustered`
PARTITION BY
RANGE_BUCKET(Year, GENERATE_ARRAY(2015,2022,1))
CLUSTER BY
type
AS (
SELECT
*
FROM `mydataset.avocado_base`
);
When you finish running the code, switch to the Results tab. Click Go to table and take note of the Details pane. Save the details for later by taking a screenshot or copying and pasting the information into another document. After this activity, you’ll compare this to the exemplar.
Part 3: Query the tables and compare performance
Step 1: Query the table without a partition or cluster
Delete the code in the Editor tab, then copy and paste the following code. Click Run to query avocados—the table without partition or cluster.
SELECT
year,
COUNT(*) AS number_avocados,
SUM(TotalVolume) AS sum_totalVolume,
SUM(AveragePrice) AS sum_AveragePrice
FROM `mydataset.avocados`
WHERE type = 'organic'
GROUP BY year
ORDER BY year ASC;
When the query has finished running, check the Execution details tab. This explains that the number of records read is the total number of records in the table. In this query, the database processes all records from the table. This is reflected in S00:Input.
Note: The Working timing section on your screen might vary in color or duration. Your SQL query might take longer or shorter to run depending on differing BigQuery engine server speeds. Your screen might not match the following screenshot, but the records read and records written should match with the Rows section.
In the next steps, take note of the S00 and S01 rows as described in the preceding screenshot. You will need to compare these details to the exemplar.
Step 2: Query the partitioned table
SELECT
year,
COUNT(*) AS number_avocados,
SUM(TotalVolume) as sum_TotalVolume,
SUM(AveragePrice) as sum_AveragePrice
FROM `mydataset.avocados_partitioned`
WHERE type = 'organic'
GROUP BY year
ORDER BY year ASC;
When the query has finished running, check the Execution details tab and save a screenshot of it. You’ll need to compare these details to the exemplar.
Step 3: Query the partitioned and clustered table
Delete the code in the Editor tab, then copy and paste the following code. Click Run to query avocados_clustered—the table that is partitioned by an integer range and clustered by type.
SELECT
year,
COUNT(*) AS number_avocados,
SUM(TotalVolume) as sum_TotalVolume,
SUM(AveragePrice) as sum_AveragePrice
FROM `mydataset.avocados_clustered`
WHERE type = 'organic'
GROUP BY year
ORDER BY year ASC;
When the query has finished running, check the Execution details tab and save a screenshot of it. You will need to compare these details to the exemplar.
Reading: Activity Exemplar: Partition data and create indexes in BigQuery
Reading: Case study: Deloitte – Optimizing outdated database systems
Reading
In this part of the course, you have been learning about the importance of database optimization. This basically means maximizing the speed and efficiency with which data is retrieved in order to ensure high levels of database performance. Part of a BI professional’s job is examining resource-use and identifying better data sources and structures. In this case study, you will have an opportunity to explore an example of how the BI team at Deloitte handled optimization when they discovered their current database was difficult for users to query.
Company background
Deloitte collaborates with independent firms to provide audit and assurance, consulting, risk and financial advisory, risk management, tax, and related services to select clients. Deloitte’s brand vision is to be a standard of excellence within the field and to uphold their brand values as they develop leading strategies and cutting edge tools for clients to facilitate their business. These values include integrity, providing outstanding value to clients, commitment to community, and strength from cultural diversity.
The challenge
Because of the size of the company and their ever-evolving data needs, the database grew and changed to match current problems without time to consider long-term performance. Because of this, the database eventually grew into a collection of unmanaged tables without clear joins or consistent formatting. This made it difficult to query the data and transform it into information that could effectively guide decision making.
The need for optimization appeared gradually as the team had to continually track data and had to repeatedly test and prove the validity of the data. With a newly optimized database, the data could more easily be understood, trusted, and used to make effective business decisions.
Primarily, this database contained marketing and financial data that would ideally be used to connect marketing campaigns and sales leads to evaluate which campaigns were successful. But because of the current state of the database, there was no clear way to tie successes back to specific marketing campaigns and evaluate their financial performance. The biggest challenge to this initiative was programming the external data sources to feed data directly into the new database, rather than into the previous tables that were scheduled to be deprecated. Additionally, the database design needed to account for tables that represented the lifecycle of the data and designed with joins that could easily and logically support different data inquiries and requests.
The approach
Because of the scale of the project and the specific needs of the organization, the BI team decided to design their own database system that they could implement across the entire organization. That way, the architecture of the database would really capture their data needs and connect tables thoughtfully so they were easier to query and use.
For example, the team wanted to be able to easily connect the initial estimate of a marketing campaign’s financial success with its ending value and how well internal processes were able to predict the success of a campaign. Increases from the initial estimate were good, but if estimates were frequently much higher than actual outcomes, it could indicate an issue with the tooling used to develop those estimates. But in the database’s current state, there were dozens of tables across accounting groups that were creating access issues that were preventing these insights from being made. Also, the different accounting groups had a lot of overlap that the team hoped to more thoughtfully structure for more long-term use.
To achieve these goals, the team strategized the architecture, developed checkpoints to determine if the required functionality could be developed and eventually scaled up, and created an iterative system wherein regular updates to the database system could be made to continue refining it moving forward.
In order to consider the database optimization project a success, the BI team wanted to address the following issues:
- Were the necessary tables and columns consolidated in a more useful way?
- Did the new schema and keys address the needs of analyst teams?
- Which tables were being queried repeatedly and were they accessible and logical?
- What sample queries could promote confidence in the system for users?
A variety of partners and stakeholders had to be involved in the optimization project because so many users across the organization would be affected. The database administrators and engineers working with the BI team were particularly key for this project because they led the design and creation of the database, mapped the life cycle of the data as it matured and changed over time and used that as a framework to construct a logical data-flow design.
These engineers then conducted interviews with various stakeholders to understand the business requirements for teams across the entire organization, trained a team of analysts on the new system, and deprecated the old tables that weren’t working.
The results
Deloitte’s BI team recognized that, while the database had been continually updated to address evolving business needs, it had grown harder to manage over time. In order to promote greater database performance and ensure their database could meet their needs, the BI team collaborated with database engineers and administrators to design a custom database architecture that thoughtfully addressed the business needs of the organization. For example, the new database structure helped build connections between tables tracking marketing campaigns over time and their successes, including revenue data and regional locations.
This database optimization effort had a lot of benefits. The greatest benefit was the organization’s ability to trust their data–the analyst team didn’t have to spend as much time validating the data before use because the tables were now organized and joined in more logical ways. The new architecture also promoted simpler queries. Instead of having to write hundreds of lines of complicated code to return simple answers, the new database was optimized for simpler, shorter queries that took less time to run.
This provided benefits for teams across the organization:
- The marketing team was able to get better feedback on the value created by specific campaigns.
- The sales team could access specific information about their regions and territories, giving them insights about possible weaknesses and opportunities for expansion.
- The strategy team was able to bridge the gap between the marketing and sales teams, and ultimately create actionable OKRs (Objectives and Key Results) for the future.
However, as you have been learning, database optimization is an iterative process. The BI team’s work didn’t stop once they implemented the new database design. They also designated a team to oversee data governance to ensure the quality of the data and prevent the same problems from happening again. This way, the data remains organized and also this team can continue refining the developed databases based on evolving business needs.
Conclusion
The databases where your organization stores their data are a key part of the BI processes and tools you create–if the database isn’t performing well, it will affect your entire organization and make it more difficult to provide stakeholders with the data they need to make intelligent business decisions. Optimizing your database promotes high performance and a better user experience for everyone on your team.
Video: The five factors in action
- Workload: This is the combination of transactions, queries, data warehousing analysis, and system commands being processed by the database system at any given time. In this case, most of the workload is processing user requests such as generating scheduled reports or fulfilling queries. If the database can’t handle the workload, it might cause the system to crash, disrupting user’s ability to access and use the data.
- Throughput: This is the overall capability of the database’s hardware and software to process requests. Because the movie theater system is mostly focused on analysis of data from OLTP databases, they are working with an OLAP database that primarily uses cloud storage. The database storage processes and the computers within the system, which are accessing the cloud data, need to be capable of handling the theaters workload, especially when the database system is being used a lot.
- Resources: The hardware and software that compose the system’s throughput are the resources. For example, the movie theaters might use a cache controller disc to help the database manage the storage and retrieval of data from the memory systems.
- Optimization: Ideally users should be able to access transaction data that has been ingested from multiple other database systems. If retrieval slows down, it can take longer to get the data and provide insights to stakeholders. This is why keeping the database optimized even after it has been set up is important.
- Contention: The movie theater company has a team with many different analysts accessing and using this data. That’s in addition to the automated transformations being applied to the data and the reports being generated. All these requests can end up competing with each other and cause contention. And this can potentially be problematic if the system processes multiple requests at the same time, essentially making the same updates over and over. To limit this, the database processes queries and the order the requests are made.
It is important to consider all five of these factors when designing and managing a database system, as they can all have a significant impact on performance.
The Five Factors in Action in Business Intelligence
The five factors of business intelligence (BI) performance are workload, throughput, resources, optimization, and contention. These factors are essential considerations for any BI professional, as they can all have a significant impact on performance.
Workload
Workload is the combination of transactions, queries, data warehousing analysis, and system commands being processed by the BI system at any given time. In general, the higher the workload, the more demanding it will be on the system’s resources.
Throughput
Throughput is the overall capability of the BI system’s hardware and software to process requests. It is measured in terms of the number of queries or transactions that can be processed per second.
Resources
The resources available to the BI system include the hardware and software components that make up the system, such as the CPU, memory, storage, and network. The availability of resources can have a significant impact on throughput and performance.
Optimization
Optimization refers to the process of improving the performance of the BI system by tuning the hardware and software components, as well as the database and application code. Optimization can help to improve throughput, reduce response times, and improve overall performance.
Contention
Contention occurs when multiple users or processes are competing for the same resources. In the context of BI, contention can occur when multiple users are running complex queries, or when the system is performing resource-intensive tasks such as data loading or indexing.
How the Five Factors Work Together
The five factors of BI performance are all interrelated. For example, if the workload is high, it may be necessary to increase the resources available to the system in order to maintain throughput. Similarly, if the system is experiencing contention, it may be necessary to optimize the system or workload to improve performance.
Example
Consider a BI system that is used to generate reports on sales data. The workload for this system is likely to be highest during the peak sales season. During this time, the system may need to process a large number of queries from different users. If the system does not have enough resources to handle the workload, it may experience performance problems, such as slow response times or timeouts.
To improve performance, the BI administrator could increase the resources available to the system, such as by adding more CPU or memory. The administrator could also optimize the system by tuning the database and application code. Additionally, the administrator could work with users to reduce the workload during peak times.
Conclusion
The five factors of BI performance are essential considerations for any BI professional. By understanding these factors and how they work together, BI professionals can improve the performance of their systems and deliver better insights to their users.
Here are some additional tips for improving BI performance:
- Use a data warehouse or data lake to store and manage your data. This will help to improve performance by separating the data from the operational systems.
- Optimize your database queries. This can be done by using the appropriate indexes and by writing efficient SQL code.
- Use a caching layer to store frequently accessed data in memory. This can help to improve performance by reducing the number of database queries that need to be executed.
- Use a load balancer to distribute the workload across multiple BI servers. This can help to improve performance and scalability.
- Monitor your BI system performance and make adjustments as needed. This can be done using a variety of tools and techniques, such as system monitoring tools and performance testing.
Earlier, we learned about the five
factors of database performance: Workload, throughput, resources,
optimization and contention. But how do they actually operate
within a working system. Let’s explore a database and witness how
the five factors affect its performance. Before we get into how the five
factors influence this database, let’s understand how it’s been designed. In this example we’ll be checking
out a movie theater chain system. There are a few things that we’ll
need to consider during optimization. First let’s think about what
this database is being used for. In this case the movie theater chain uses
data related to ticket purchases, revenue and audience preferences in order to make
decisions about what movies to play and potential promotions. Second we’ll consider where
the data is coming from. In this example it’s being
pushed from multiple sources into an OLAP system
where analysis takes place. Also the database uses data from
individual theaters OLTP systems in order to explore trends and ticket sales for
different movie times and genres. The OLTP systems that manage transaction
data, uses snowflake database model. At the center there is a fact
table capturing the most important information about the tickets
such as whether or not a specific seat has been reserved,
the reservation type, the screening ID. The employee ID of whoever entered
the reservation and the seat number. In order to capture details about
these facts, the model also includes several dimension tables connected to the
fact table with information on employee, movie, screening, auditorium,
seat and reservation. This database is fairly
straightforward and enables each movie theater to record
data in these different tables and prevents them from accidentally
booking the same seat twice. However, these individual OLTP
systems aren’t designed for analysis, which is why the data needs to be pulled
into the destination OLAP system. There, it can be accessed and explored
by users in order to gain insights and make business decisions. Okay, now that we know a little
more about our database, let’s find out how the five factors
of database performance influence it. First as you know, workload is
a combination of transactions, queries, data warehousing analysis and system commands being processed by
the database system at any given time. In this case, most of the workload
is processing user requests such as generating scheduled reports or
fulfilling queries. If the database can’t handle the workload,
it might cause the system to crash disrupting user’s ability to access and
use the data. Maybe report requires a lot
of resources to generate or there might be a growing number
of analysts accessing this data. But we know that it’s often possible
to predict peak workload times. So we can make adjustments to ensure
the system can handle these requests. Now, let’s explore throughput. Again this is the overall capability
of the database’s hardware and software to process requests. Because our movie theater system is
mostly focused on analysis of data from OLTP databases, were working with an OLAP database
that primarily uses cloud storage. The database storage processes and
the computers within the system, they’re accessing the cloud data. Need to be capable of handling
the theaters workload, especially when the database
system is being used a lot. The hardware and software that compose
the system’s throughput are the resources. For example, the movie theaters might
use a cache controller disc to help the database manage the storage and
retrieval of data from the memory systems. Next we have optimization,
which you’ve already learned a lot about. Ideally users should be able to access
transaction data that has been ingested from multiple other database systems. If retrieval slows down,
it can take longer to get the data and provide insights to stakeholders. This is why keeping the database optimized
even after it has been set up is important. The last factor,
database performance is contention. The movie theater company has a team with
many different analysts accessing and using this data. That’s in addition to the automated
transformations being applied to the data and the reports being generated. All these requests can end up competing
with each other and cause contention. And this can potentially be problematic
if the system processes multiple requests at the same time, essentially
making the same updates over and over. To limit this,
the database processes queries and the order the requests are made. And now, you’ve gotten a chance to
explore how the five factors of database performance might affect
a real database system. No matter how simple or complex,
these are essential considerations for any BI profession.
Reading: Determine the most efficient query
Upgraded Plugin: Design: Optimize for database speed
Practice Quiz: Test your knowledge: Database performance
Fill in the blank: A data mart is a _ database that can be a subset of a larger data warehouse. This means it is a convenient way to access the data pertaining to specific areas or departments of a business.
Subject-oriented
A data mart is a subject-oriented database that can be a subset of a larger data warehouse. This means it is a convenient way to access the data pertaining to specific areas or departments of a business.
AI assistant
The answer is: Subject-oriented.
A data mart is a subset of a larger data warehouse that is designed to support specific business functions or departments. This means that it is focused on a specific subject area, such as sales, marketing, or customer service. Data marts are typically smaller and more specialized than data warehouses, which makes them easier to manage and use.
Here are some of the benefits of using a data mart:
- Improved performance: Data marts are typically smaller and less complex than data warehouses, so they can provide faster query performance.
- Increased accessibility: Data marts can be made available to a wider range of users than data warehouses, which can help to improve business intelligence and decision-making.
- Reduced costs: Data marts can be less expensive to implement and maintain than data warehouses.
Data marts are a valuable tool for businesses of all sizes, but they can be especially beneficial for small and medium-sized businesses that may not have the resources to support a full-fledged data warehouse.
A business intelligence team manager wants to support their team’s ability to perform at a high level. They investigate the overall capability of their company’s database hardware and software tools to enable the team to process stakeholder requests. In this situation, which of the factors of database performance do they consider?
Throughput
They consider throughput. Throughput describes the overall capability of the database’s hardware and software to process requests.
AI assistant
The factor of database performance that the business intelligence team manager is considering is throughput.
Throughput is the overall capability of the database hardware and software tools to process requests. It is measured in terms of the number of queries or transactions that can be processed per second.
In this case, the team manager is concerned about whether or not the database can handle the workload of the team, which includes processing stakeholder requests. If the database does not have enough throughput, it may experience performance problems, such as slow response times or timeouts.
The other factors of database performance are:
- Workload: The combination of transactions, queries, data warehousing analysis, and system commands being processed by the database system at any given time.
- Resources: The hardware and software components that make up the system, such as the CPU, memory, storage, and network.
- Optimization: The process of improving the performance of the database system by tuning the hardware and software components, as well as the database and application code.
- Contention: When multiple users or processes are competing for the same resources.
While all of these factors are important for database performance, throughput is the most relevant factor in the situation described in the question.
Here are some things that the team manager can do to improve throughput:
- Increase the resources available to the database, such as by adding more CPU or memory.
- Optimize the database and application code.
- Use a load balancer to distribute the workload across multiple database servers.
- Monitor the database performance and make adjustments as needed.
What term is used to describe data that is broken up into many pieces that are not stored together?
Fragmented data
Fragmentation most often occurs when the data is used frequently, when new data files are created, or when existing data files are modified or deleted.
AI assistant
The term fragmented data is used to describe data that is broken up into many pieces that are not stored together. This can happen when data is added to a database over time, or when data is moved from one database to another.
Fragmented data can have a negative impact on database performance, as the database has to search for all of the pieces of data in order to return a result. This can lead to slow response times and timeouts.
There are a number of ways to reduce fragmentation, such as:
- Defragmenting the database: This involves reorganizing the data so that it is stored together in contiguous blocks.
- Using a different database design: Some database designs are less susceptible to fragmentation than others.
- Using a database management system (DBMS): A DBMS can help to reduce fragmentation by automatically reorganizing the data as needed.
If you are experiencing performance problems with your database, it is important to check for fragmentation and take steps to reduce it if necessary.
Review: Dynamic database design
Video: Wrap-up
You have been learning about database design and the role of BI professionals in creating and maintaining useful database systems. You have also learned about the five factors of database performance, database optimization strategies, and the importance of monitoring database performance.
As a BI professional, developing processes that enable your team to pull insights themselves is a key part of the job. However, systems and processes change over time, so it is important to continue to monitor database performance.
In the next lesson, you will learn more about optimizing systems and the tools you will create as a BI professional. You will also learn about optimizing ETL processes.
You’ve been learning about
database design and the role BI professionals
play in creating and maintaining useful
database systems. So far, you focus on the five factors of
database performance; workload, throughput, resources, optimization,
and contention. You also learn some strategies specifically for
database optimization, and what issues to check for if your team members start
noticing a slowdown. You even explored how the five factors can
affect actual databases, database optimization, and the importance of keeping
databases up to speed. As a BI professional, developing processes
that enable your team to pull insights themselves
is a key part of the job. But systems and processes
change over time, they stop working or
need to be updated. That’s one of the reasons
why continuing to monitor database performance
is so important. The database system should have lasting high performance levels. Coming up, you’re going
to discover more about optimizing systems and the tools you’ll create as a
BI professional. But first, you have
another weekly challenge. As always, feel free to check back over any
of the material, and review the glossary to
prepare yourself for success. Once you’ve completed
your assessment, I’ll meet you back
here for more about optimizing ETL
processes. Great job.
Reading: Glossary terms from module 2
Reading
Contention: When two or more components attempt to use a single resource in a conflicting way
Data partitioning: The process of dividing a database into distinct, logical parts in order to improve query processing and increase manageability
Database performance: A measure of the workload that can be processed by a database, as well as associated costs
ELT (extract, load, and transform): A type of data pipeline that enables data to be gathered from data lakes, loaded into a unified destination system, and transformed into a useful format
Fragmented data: Data that is broken up into many pieces that are not stored together, often as a result of using the data frequently or creating, deleting, or modifying files
Index: An organizational tag used to quickly locate data within a database system
Optimization: Maximizing the speed and efficiency with which data is retrieved in order to ensure high levels of database performance
Query plan: A description of the steps a database system takes in order to execute a query
Resources: The hardware and software tools available for use in a database system
Subject-oriented: Associated with specific areas or departments of a business
Throughput: The overall capability of the database’s hardware and software to process requests
Workload: The combination of transactions, queries, data warehousing analysis, and system commands being processed by the database system at any given time
Quiz: Module 2 challenge
Which of the following statements accurately describe data marts and data lakes? Select all that apply.
- Data lakes are designed to enable information accessibility because their data doesn’t require a lot of processing.
- Data marts are subject-oriented, which means they are associated with specific areas or departments of a business.
AI assistant
Data marts are subject-oriented, which means they are associated with specific areas or departments of a business. This is because data marts are typically created for a specific purpose, such as to support the sales department or the marketing department.
Data lakes are designed to enable information accessibility because their data doesn’t require a lot of processing. This is because data lakes are typically used to store raw data, which has not been processed or cleaned. This makes it easy to access the data, but it also means that the data may not be immediately usable.
Therefore, the following statements accurately describe data marts and data lakes:
- Data marts are subject-oriented.
- Data lakes are designed to enable information accessibility.
The other two statements are incorrect. Data lakes are not subject-oriented, and data marts are not designed to enable information accessibility because their data does require a lot of processing.
What are some key benefits of ELT data pipelines in business intelligence?
- ELT reduces storage costs and enables businesses to scale storage and computation resources independently.
- ELT can ingest many different kinds of data into a storage system as soon as that data is available.
- ELT enables business intelligence professionals to transform only the data they need.
AI assistant
Key benefits of ELT data pipelines in business intelligence:
- ELT reduces storage costs and enables businesses to scale storage and computation resources independently. With ELT, data is loaded into a data lake in its raw form, without any processing. This means that businesses can store large amounts of data without having to invest in expensive processing power. Additionally, businesses can scale their storage and computation resources independently, as needed.
- ELT enables business intelligence professionals to transform data while it is being transported. This means that businesses can start analyzing their data sooner, without having to wait for it to be fully processed.
- ELT can ingest many different kinds of data into a storage system as soon as that data is available. This makes it a good choice for businesses that need to process large volumes of data from a variety of sources, such as streaming data from sensors or social media data.
- ELT enables business intelligence professionals to transform only the data they need. This can save time and resources, as businesses do not have to process all of their data upfront.
In addition to these benefits, ELT data pipelines can also help businesses to:
- Improve data quality by enabling businesses to identify and correct data errors before they are loaded into the data warehouse.
- Increase data agility by making it easier for businesses to add new data sources and change their data processing rules.
- Reduce IT costs by simplifying the data processing process and eliminating the need for dedicated ETL tools.
Overall, ELT data pipelines offer a number of benefits for businesses that are looking to improve their business intelligence capabilities.
Here are some examples of how ELT data pipelines can be used in business intelligence:
- A retail company can use an ELT data pipeline to ingest data from its point-of-sale system, e-commerce platform, and customer relationship management (CRM) system into a data lake. The company can then use this data to analyze customer behavior, identify trends, and improve its marketing and sales strategies.
- A financial services company can use an ELT data pipeline to ingest data from its trading platform, risk management system, and customer relationship management (CRM) system into a data lake. The company can then use this data to analyze market trends, identify risk factors, and develop new financial products and services.
- A healthcare provider can use an ELT data pipeline to ingest data from its electronic health record (EHR) system, patient portal, and billing system into a data lake. The provider can then use this data to improve patient care, identify areas for cost savings, and comply with regulatory requirements.
ELT data pipelines are a powerful tool that can help businesses to improve their business intelligence capabilities and achieve their business goals.
What is a measure of the workload that can be processed by a database, as well as the associated costs?
Database performance
AI assistant
Database performance is a measure of the workload that can be processed by a database, as well as the associated costs. It is typically measured in terms of throughput, response time, and resource utilization.
Throughput is the number of transactions that can be processed by the database per unit time. Response time is the average time it takes for the database to respond to a query. Resource utilization is the percentage of time that the database’s resources, such as CPU and memory, are being used.
Database performance is important because it can affect the user experience, the cost of operating the database, and the ability of the database to support the business’s needs. For example, if a database is not performing well, users may experience long response times or even errors. This can lead to frustration and lost productivity. Additionally, a database that is not performing well may require more resources, such as CPU and memory, to operate. This can increase the cost of operating the database. Finally, if a database is not performing well, it may not be able to support the business’s needs. For example, a database that is not performing well may not be able to process the volume of data that the business needs to process or support the number of users that need to access the database.
There are a number of factors that can affect database performance, including the hardware that the database is running on, the software that the database is using, and the way that the database is configured and managed.
Here are some tips for improving database performance:
- Choose the right hardware for the database. The database should have enough CPU, memory, and storage to meet the needs of the workload.
- Use the right software for the database. The database software should be optimized for the type of workload that the database will be running.
- Configure the database properly. The database should be configured to meet the needs of the workload and to minimize resource usage.
- Monitor the database performance. The database performance should be monitored regularly to identify any potential problems.
By following these tips, you can improve database performance and ensure that the database is meeting the needs of the business.
Which of the following statements accurately describes workload with regards to database performance?
Workload is the combination of transactions, queries, analysis, and system commands being processed by the database system at any given time.
AI assistant
The following statement accurately describes workload with regards to database performance:
Workload is the combination of transactions, queries, analysis, and system commands being processed by the database system at any given time.
This means that workload includes all of the activities that the database is performing, such as processing transactions, responding to queries, and performing analytics. Workload is important because it can have a significant impact on database performance. For example, a heavy workload can cause the database to slow down or even become unresponsive.
The other statements are not accurate:
- Workload is not two or more components attempting to use a single resource in a conflicting way. This is called contention.
- Workload is not the overall capability of the database’s hardware and software to process requests. This is called scalability.
- Workload is not about maximizing the speed and efficiency with which data is retrieved. This is called database optimization.
Database performance is important because it can affect the user experience, the cost of operating the database, and the ability of the database to support the business’s needs. By understanding workload, database administrators can take steps to improve database performance and ensure that the database is meeting the needs of the business.
When evaluating a database system’s resources, what does a business intelligence professional consider? Select all that apply.
Software, Disk space and memory, Hardware
AI assistant
A business intelligence professional considers the following when evaluating a database system’s resources:
- Hardware: This includes the CPU, memory, and storage devices of the database system. The hardware must be powerful enough to support the database workload.
- Software: This includes the database software itself, as well as any other software that is used to interact with the database, such as business intelligence tools.
- Disk space and memory: The database system needs enough disk space to store the data and enough memory to process the data.
- Users: The database system must be able to support the number of users who need to access it.
In addition to these factors, business intelligence professionals may also consider other factors when evaluating a database system’s resources, such as:
- Cost: The cost of the database system, including the hardware, software, and maintenance costs.
- Scalability: The ability of the database system to scale to meet the needs of the business as it grows.
- Security: The security features of the database system and its ability to protect the data from unauthorized access.
- Availability: The availability of the database system and its ability to be up and running when needed.
By considering all of these factors, business intelligence professionals can choose a database system that meets the needs of their business and helps them to achieve their business goals.
Here are some examples of how business intelligence professionals consider database system resources when making decisions:
- A business intelligence professional may recommend that a company purchase a new database server with more CPU and memory in order to improve the performance of the company’s database system.
- A business intelligence professional may recommend that a company implement a database caching solution in order to reduce the load on the database server and improve the performance of database queries.
- A business intelligence professional may recommend that a company implement a data archiving solution in order to reduce the amount of data that needs to be stored on the database server and improve the performance of database queries.
By carefully considering database system resources, business intelligence professionals can help businesses to improve the performance of their database systems and make better decisions about their data.
Optimization involves decreasing _, which is how long it takes for a database to respond to a user request.
Response time
AI assistant
Optimization involves decreasing response time, which is how long it takes for a database to respond to a user request.
Response time is an important measure of database performance, as it can affect the user experience and the overall productivity of a business. A long response time can lead to frustration and lost productivity, as users may have to wait for an extended period of time for the database to respond to their requests.
There are a number of factors that can affect database response time, including the workload of the database, the hardware and software that the database is using, and the way that the database is configured and managed.
Here are some tips for decreasing database response time:
- Reduce the workload on the database. This can be done by offloading some of the workload to other servers, such as a web server or application server.
- Upgrade the database hardware and software. This can help to improve the performance of the database and reduce response times.
- Configure the database properly. The database should be configured to meet the needs of the workload and to minimize resource usage.
- Monitor the database performance. The database performance should be monitored regularly to identify any potential problems.
By following these tips, you can decrease database response time and improve the overall performance of your database system.
A business intelligence professional is investigating the steps their database system takes in order to execute a query. They discover that creating a new table will enhance performance. What does this scenario describe?
Checking a query plan
AI assistant
The scenario you described describes checking a query plan.
A query plan is a visualization of the steps that a database system takes in order to execute a query. It shows the order in which the database system will access the tables and indexes in the database, and the operations that will be performed on the data.
Business intelligence professionals can use query plans to identify performance bottlenecks and to tune queries for better performance. For example, if a query plan shows that the database system is creating a new table in order to execute a query, the business intelligence professional may be able to optimize the query by creating the table ahead of time.
Here are some other things that business intelligence professionals can look for when checking a query plan:
- Are the tables and indexes being accessed in the correct order?
- Are the database system using the correct indexes?
- Are the database system using the optimal operations?
If the business intelligence professional finds anything that can be improved in the query plan, they can adjust the query and then check the query plan again to see if the improvement has been made.
By checking query plans, business intelligence professionals can help to improve the performance of database systems and make database queries more efficient.
Which of the following statements accurately describe indexes versus data partitions? Select all that apply.
- Indexes are organizational tags used to locate data.
- Data partitioning is the process of dividing a database into distinct, logical parts.
- Data partitioning is typically used in cloud-based systems handling big data.
AI assistant
The following statements accurately describe indexes versus data partitions:
- Indexes are organizational tags used to locate data.
- Data partitioning is the process of dividing a database into distinct, logical parts.
- Data partitioning is typically used in cloud-based systems handling big data.
- Indexes can only locate one section of a table at a time.
Indexes are data structures that improve the performance of database queries by organizing the data in a way that makes it easier to find. Indexes can be created on one or more columns in a table, and they can be used to locate specific rows or ranges of rows in the table.
Data partitioning is the process of dividing a database into distinct, logical parts. Each partition is a separate set of data that is stored and managed independently of the other partitions. Data partitioning can be used to improve the performance of database queries, to reduce the storage costs of the database, and to make it easier to manage the database.
Data partitioning is typically used in cloud-based systems handling big data. This is because data partitioning can help to improve the performance of database queries on large datasets. Additionally, data partitioning can help to reduce the storage costs of cloud-based databases, as only the partitions that are being accessed need to be loaded into memory.
Indexes can only locate one section of a table at a time. This means that if a query needs to access data from multiple sections of a table, then the index will not be able to help the database system to locate the data as quickly.
Here are some examples of how indexes and data partitions are used:
- A company may create an index on the customer_id column in its customer table. This index would make it easier to find specific customers in the table.
- A retail company may partition its sales data table by region. This would make it easier to query the data for specific regions.
- A streaming video service may partition its video data table by date. This would make it easier to query the data for videos that were uploaded on specific dates.
Indexes and data partitions are both important tools for improving the performance of database systems. By understanding how indexes and data partitions work, business intelligence professionals can use them to improve the performance of their database systems and make better decisions about their data.
Fill in the blank: Fragmented data occurs when data is broken up into many pieces that are not_____, often as a result of using the data frequently.
stored together
AI assistant
Fragmented data occurs when data is broken up into many pieces that are not stored together. This can happen for a number of reasons, such as when data is added to a database over time and the new data is stored in different locations on the database server. It can also happen when data is deleted from a database and the space that was occupied by the deleted data is not reused.
Fragmented data can lead to performance problems, as the database server has to spend more time searching for the different pieces of data that make up a record. It can also make it more difficult to manage and maintain the database.
There are a number of ways to prevent and fix data fragmentation, such as:
- Regularly defragmenting the database: This will reorganize the data on the database server so that it is stored together.
- Using a database management system (DBMS) that supports data partitioning: This allows you to divide the database into smaller, more manageable parts.
- Designing the database schema carefully: This will help to minimize the amount of data fragmentation that occurs.
By following these tips, you can help to improve the performance and manageability of your database system.
When two or more data analysts attempt to use a single data resource in a conflicting way, what is the result?
Contention
AI assistant
When two or more data analysts attempt to use a single data resource in a conflicting way, the result is contention.
Contention occurs when two or more processes or instances compete for access to the same resource at the same time. In the context of data analytics, this can happen when two or more analysts are trying to write to the same data file or database table at the same time.
Contention can lead to a number of problems, including:
- Data corruption: If two analysts are trying to write different values to the same data field at the same time, it can corrupt the data.
- Deadlocks: Deadlocks occur when two or more processes are waiting for each other to finish before they can proceed. This can happen when two analysts are trying to lock the same data resource at the same time.
- Reduced performance: Contention can reduce the performance of database systems, as the database system has to spend more time managing the conflicting requests.
There are a number of ways to prevent and reduce contention, such as:
- Using a database management system (DBMS) that supports concurrency control: This allows the DBMS to manage concurrent access to data resources and prevent conflicts from occurring.
- Designing the database schema carefully: This can help to reduce the amount of contention that occurs.
- Educating data analysts on the importance of avoiding contention: Data analysts should be aware of the potential problems caused by contention and should take steps to avoid it.
By following these tips, you can help to prevent and reduce contention and improve the performance and reliability of your data analytics systems.