Week 3: Optimize ETL processes

You’ll learn about optimization techniques including ETL quality testing, data schema validation, business rule verification, and general performance testing. You’ll also explore data integrity and learn how built-in quality checks defend against potential problems. Finally, you’ll focus on verifying business rules and general performance testing to make sure pipelines meet the intended business need.

Learning Objectives

Discover strategies to create an ETL process that works to meet organizational and stakeholder needs and how to maintain an ETL process efficiently.
Introduce tools used in ETL
Understand the primary goals of ETL quality testing.
Understand the primary goals of data schema validation.
Develop ETL quality testing and data schema validation best practices.
Identify and implement appropriate test scenarios and checkpoints for QA on data pipelines.
Explain different methods for QA data in the pipeline.
Create performance testing scenarios and measure performance throughout the pipeline.
Verify business rules.
Perform general performance testing.

Optimizing pipelines and ETL processes

Video: Welcome to module 3

Notes

Transcript

In this section of the course, BI professionals will learn about ETL quality testing, data schema validation, verifying business rules, and general performance testing.

ETL quality testing ensures that data is extracted, transformed, and loaded to its destination without any errors or issues.
Data schema validation keeps source data aligned with the target database schema. A schema mismatch can cause system failures.
Verifying business rules ensures that the pipeline is fulfilling the business need it was intended to.
General performance testing ensures that the pipeline is performing efficiently.

These optimization processes are important for keeping the ETL running smoothly and ensuring that data is accurate and reliable.

You’ve learned a lot about
how BI professionals ensure that their organizations’ database systems and tools continue to be as
useful as possible. This includes evaluating whether fixes or updates are needed, and performing optimization
when necessary. Previously, we focused specifically on optimizing
database systems. Now it’s time to explore optimizing pipelines
and ETL processes. In this section of the course, you’ll learn about
ETL quality testing, data schema
validation, verifying business rules and general
performance testing. Through ETL quality testing, BI Professionals aim to confirm
that data is extracted, transformed, and loaded to its destination without
any errors or issues. This is especially
important because sometimes your pipeline might
start producing bad or misleading results. This can happen when
the original sources are changed without
your knowledge. Also, we’ll soon cover
data schema validation, which is used to
keep source data aligned with the target
database schema. A schema mismatch can
cause system failures. This is critical to keeping
the ETL running smoothly. We’ll also investigate
data integrity and how built-in quality checks defend against
potential problems. Finally, we’ll focus on
verifying business rules and general performance
testing to make sure the pipeline is fulfilling the business need
it was intended to. There’s a lot to come. So
let’s begin exploring the optimization processes
for pipelines and ETL.

Video: The importance of quality testing

Notes

Tutorial

Quiz

Transcript

ETL quality testing is the process of checking data for defects in order to prevent system failures. It involves seven validation elements:

Completeness: Confirming that the data contains all the desired components or measures.
Consistency: Confirming that data is compatible and in agreement across all systems.
Conformity: Confirming that the data fits the required destination format.
Accuracy: Confirming that the data conforms to the actual entity that’s being measured or described.
Redundancy: Ensuring that there isn’t any redundancy in the data.
Integrity: Checking for any missing relationships in the data values.
Timeliness: Confirming that data is current.

These elements are important for ensuring that data is accurate and reliable. BI professionals can use a variety of methods to test data quality, such as data mapping, data profiling, and data validation.

Tutorial on the Importance of Quality Testing in Business Intelligence

Introduction

Business intelligence (BI) is the process of collecting, analyzing, and interpreting data to help businesses make better decisions. Quality testing is the process of checking data for errors or inconsistencies before it is used in BI applications.

Why is quality testing important in BI?

Quality testing is important in BI because it helps to ensure that the data used to make decisions is accurate and reliable. If the data is inaccurate or unreliable, the decisions made using that data could be flawed.

This could lead to a number of problems, such as:

Lost revenue
Increased costs
Decreased customer satisfaction
Damage to the company’s reputation

What are the benefits of quality testing in BI?

There are a number of benefits to quality testing in BI, including:

Improved data accuracy and reliability
Reduced risk of errors in decision-making
Increased confidence in BI results
Improved compliance with regulations
Enhanced ability to identify and mitigate risks
Improved business performance

How to perform quality testing in BI

There are a number of different ways to perform quality testing in BI. Some common methods include:

Data profiling: Data profiling is the process of analyzing data to identify its characteristics, such as data types, value ranges, and patterns. This information can be used to identify potential errors or inconsistencies in the data.
Data validation: Data validation is the process of checking data to ensure that it meets certain criteria, such as being within a specific value range or conforming to a specific format.
Data reconciliation: Data reconciliation is the process of comparing data from different sources to identify any discrepancies.
Data cleansing: Data cleansing is the process of correcting or removing errors and inconsistencies from data.

Conclusion

Quality testing is an important part of any BI process. By ensuring that the data used in BI applications is accurate and reliable, businesses can make better decisions and improve their performance.

Here are some additional tips for performing quality testing in BI:

Identify the risks: Before you start testing, identify the areas where the data is most likely to be inaccurate or incomplete. This will help you to focus your testing efforts on the most important areas.
Use a variety of testing methods: There is no one-size-fits-all approach to quality testing in BI. Use a variety of testing methods, such as data profiling, data validation, data reconciliation, and data cleansing, to ensure that the data is accurate and reliable.
Automate your testing: If possible, automate your quality testing process. This will save you time and help you to ensure that your data is always tested before it is used in BI applications.
Monitor your test results: Once you have tested your data, monitor the test results to identify any trends or patterns. This information can be used to improve your quality testing process and to identify areas where the data needs to be improved.

By following these tips, you can ensure that the data used in your BI applications is accurate and reliable, and that you are making the best possible decisions for your business.

When quality testing, why does a business intelligence professional confirm data conformity?

To ensure the data fits the required destination format

When quality testing, a business intelligence professional confirms data conformity in order to ensure the data fits the required destination format.

You’re already familiar
with ETL pipelines, where data is extracted
from a source, transformed while
it’s being moved, and then loaded into
a destination table where it can be acted on. Part of the
transformation step in the ETL process, is
quality testing. In BI, quality testing
is a process of checking data for defects in order
to prevent system failures. The goal is to ensure the pipeline continues
to work properly. Quality testing can
be time-consuming, but it’s extremely important for an organization’s workflow. Quality testing involves seven validation
elements: completeness, consistency, conformity, accuracy, redundancy, integrity,
and timeliness. That’s a lot of elements
to keep in mind, but we’re going to
break down each in this video, starting
with completeness. Also, you may recall some of these concepts from the Google Data Analytics certificate. If you’d like, take
a few minutes to review that content
before moving ahead. Let’s start with checking
for completeness. This involves confirming
that the data contains all the desired
components or measures. For example, imagine you’re working with sales data and you have an ETL pipeline
that delivers monthly data to target tables. These target tables are used to generate reports
for stakeholders. If the data being moved through the pipeline is missing
a week of data or information about one of
the best-selling products or another key metric, then the calculations
used to create reports won’t have
complete accurate data. Next, we have consistency. You might have learned that
in a data analytics context, consistency deals with the
degree to which data is repeatable from different
points of entry or collection. In BI, it’s a bit different. Here, consistency involves
confirming that data is compatible and in
agreement across all systems. Imagine two systems, one is an HR database
with employee data, and the other is
a payroll system. If the HR database lists an employee who either isn’t
in the payroll system, or is listed differently there, that inconsistency
could create problems. Next is conformity. This element is
all about whether the data fits the required
destination format. Consider sales data
in an ETL pipeline. If the data’s being extracted
includes dates of sale that don’t match the dates that the destination table
is designed to hold, that’s going to create errors. Now, accuracy has to do
with the data conforming to the actual entity that’s
being measured or described. Another way of
thinking about this is if the data
represents real values. With that in mind, any mistyped
entries or errors from the source are problematic because they will be
reflected in the destination. Source systems
requiring a lot of manual data entry are more likely to have issues
with accuracy. If a purchase of a hamburger was miss entered as selling for a
million dollars, that’s something you
need to take care of before the data is loaded. If you’re using a relational
storage database, ensuring that there isn’t
any redundancy in the data, is another important
element of quality testing. In a BI context,
redundancy is moving, transforming, or storing more
than the necessary data. This occurs when the
same piece of data is stored in two or more places. Moving data through
a pipeline requires processing power,
time, and resources. It’s important not to move
any more data than you need. For instance, if
client company names are listed in multiple places, but are only required
to appear in one place in the
destination table, we wouldn’t want to waste resources on loading
that redundant data. Now we come to integrity. Integrity concerns the
accuracy, completeness, consistency, and
trustworthiness of data throughout its life cycle. In quality testing,
this often means checking for any missing relationships in
the data values. As an example, say a company sales
database is relational, BI professionals would depend
on those relationships to manipulate data within the database and
to query the data. Maybe they have product IDs and descriptions in a database. But if there’s a description and no corresponding
record with the ID, there’s now an issue
with data integrity. It’s essential to
make sure this is addressed before
moving on to analysis. Data mapping is one way to
make sure that the data from the source matches the data
in the target database. You’ll learn more
about this later. But basically, data
mapping is a process of matching fields from one
data source to another. Last but not least, you want to make sure
your data is timely. Timeliness involves confirming
that data is current. This check is done
specifically to make sure data has been updated with the most recent information that can provide
relevant insights. For example, if a data warehouse is supposed to
contain daily data, but doesn’t update properly then the pipeline can’t ingest
the latest information. BI professionals are mostly
interested in exploring current data in order to allow stakeholders to gain
the freshest insights. This definitely
won’t be possible if the data being moved
is already outdated. A lot goes into ETL
quality testing, and it can be a tricky process, but remembering these
seven key elements is a wonderful first step toward creating
high-quality pipelines. Coming up, you’re going to learn even more about these checks and other performance tests for
ETL processes. Bye for now.

Reading: Seven elements of quality testing

Reading

In this part of the course, you have been learning about the importance of quality testing in your ETL system. This is the process of checking data for defects in order to prevent system failures. Ideally, your pipeline should have checkpoints built-in that identify any defects before they arrive in the target database system. These checkpoints ensure that the data coming in is already clean and useful! In this reading, you will be given a checklist for what your ETL quality testing should be taking into account.

Checking for data quality involves ensuring the data is trustworthy before reaching its destination. When considering what checks you need to ensure the quality of your data as it moves through the pipeline, there are seven elements you should consider:

Completeness: Does the data contain all of the desired components or measures?
Consistency: Is the data compatible and in agreement across all systems?
Conformity: Does the data fit the required destination format?
Accuracy: Does the data conform to the actual entity being measured or described?
Redundancy: Is only the necessary data being moved, transformed, and stored for use?
Timeliness: Is the data current?
Integrity: Is the data accurate, complete, consistent, and trustworthy? (Integrity is influenced by the previously mentioned qualities.) Quantitative validations, including checking for duplicates, the number of records, and the amounts listed, help ensure data’s integrity.

Common issues

There are also some common issues you can protect against within your system to ensure the incoming data doesn’t cause errors or other large-scale problems in your database system:

Check data mapping: Does the data from the source match the data in the target database?
Check for inconsistencies: Are there inconsistencies between the source system and the target system?
Check for inaccurate data: Is the data correct and does it reflect the actual entity being measured?
Check for duplicate data: Does this data already exist within the target system?

To address these issues and ensure your data meets all seven elements of quality testing, you can build intermediate steps into your pipeline that check the loaded data against known parameters. For example, to ensure the timeliness of the data, you can add a checkpoint that determines if that data matches the current date; if the incoming data fails this check, there’s an issue upstream that needs to be flagged. Considering these checks in your design process will ensure your pipeline delivers quality data and needs less maintenance over time.

Key takeaways

One of the great things about BI is that it gives us the tools to automate certain processes that help save time and resources during data analysis– building quality checks into your ETL pipeline system is one of the ways you can do this! Making sure you are already considering the completeness, consistency, conformity, accuracy, redundancy, integrity, and timeliness of the data as it moves from one system to another means you and your team don’t have to check the data manually later on.

Upgraded Plugin: Validate: Data quality and integrity

Reading

Validate_-Data-quality-and-integrity Download

Reading: Monitor data quality with SQL

Reading

As you’ve learned, it is important to monitor data quality. By monitoring your data, you become aware of any problems that may occur within the ETL pipeline and data warehouse design. This can help you address problems as early as possible and avoid future problems.

In this reading, you’ll follow a fictional scenario where a BI engineer performs quality testing on their pipeline and suggests SQL queries that one could use for each step of testing.

The scenario

At Francisco’s Electronics, an electronics manufacturing company, a BI engineer named Sage designed a data warehouse for analytics and reporting. After the ETL process design, Sage created a diagram of the schema.

The diagram of the schema of the sales_warehouse database contains different symbols and connectors that represent two important pieces of information: the major tables within the system and the relationships among these tables.

The sales_warehouse database schema contains five tables:

Sales
Products
Users
Locations
Orders

These tables are connected via keys. The tables contain five to eight columns (or attributes) ranging in data type. The data types include varchar or char (or character), integer, decimal, date, text (or string), timestamp, and bit.

The foreign keys in the Sales table link to each of the other tables:

The “product_id” foreign key links to the Products table
The “user_id” foreign key links to the Users table
The “order_id” foreign key links to the Orders table
The “shipping_address_id” and “billing_address_id” foreign keys link to the Locations table

After Sage made the sales_warehouse database, the development team made changes to the sales site. As a result, the original OLTP database changed. Now, Sage needs to ensure the ETL pipeline works properly and that the warehouse data matches the original OLTP database.

Sage used the original OLTP schema from the store database to design the warehouse.

The store database schema also contains five tables—Sales, Products, Users, Locations, and Orders—which are connected via keys. The tables contain four to eight columns ranging in data type. The data types include varchar or char, integer, decimal, date, text, timestamp, bit, tinyint, and datetime.

Every table in the store database has an id field as a primary key. The database contains the following tables:

The Sales table has price, quantity, and date columns. It references a user who made a sale (UserId), purchased a product (ProductId), and a related order (OrderId). Also, it references the Locations table for shipping and billing addresses (ShippingAddressId and BillingAddressId, respectively).
The Users table has FirstName, LastName, Email, Password, and other user-related columns.
The Locations table contains address information (Address1, Address2, City, State, and Postcode).
The Products table has Name, Price, InventoryNumber, and Category of products.
The Orders table has OrderNumber and purchase information (Subtotal, ShippingFee, Tax, Total, and Status).

Using SQL to find problems

Sage compared the sales_warehouse database to the original store database to check for completeness, consistency, conformity, accuracy, redundancy, integrity, and timeliness. Sage ran SQL queries to examine the data and identify quality problems. Then Sage prepared the following table of lists, which include the types of quality issues found, the quality strategies that were violated, the SQL codes used to find the issues, and specific descriptions of the issues.

Quality testing sales_warehouse

Tested quality	Quality strategy	SQL query	Sage’s observation
Integrity	Is the data accurate, complete, consistent, and trustworthy?	SELECT * FROM Orders	In the sales_warehouse database, the order with ID 7 has the incorrect total value.
Completeness	Does the data contain all of the desired components or measures?	*SELECT COUNT() FROM Locations**	The Locations table of the sales_warehouse database has an extra address. In the store database there are 60 records, whereas the sales_warehouse database table has 61.
Consistency	Is the data compatible and in agreement across all systems?	SELECT Phone FROM Users	Several users within the sales_warehouse database have phones without the “+” prefix.
Conformity	Does the data fit the required destination format?	SELECT id, postcode FROM sales_warehouse.Locations	The location ZIP code for the record with ID 6 in the sales_warehouse database is 722434213, which is wrong. The United States postal code contains either five digits or five digits followed by a hyphen (dash) and another four digits (e.g., 12345-1234).

Quality testing store

Feature	Quality Strategy	SQL query	Sage’s Observation
Integrity	Is the data accurate, complete, consistent, and trustworthy?	DESCRIBE Users	Users.Status from the store database and Users.is_active from the sales_warehouse database seem to be related fields. However, it is not obvious how the Status column is transformed into the is_active boolean column. Is it possible that with a new status value, the ETL pipeline will fail?
Consistency	Is the data compatible and in agreement across all systems?	DESCRIBE Products	Products.Inventory from the store database has the varchar type instead of the int(10) in the sales_warehouse database Products.inventory field. This can be a problem if there is a value with characters.
Accuracy	Does the data conform to the actual entity being measured or described?	DESCRIBE Sales	The data type of Sales.Date in the store database is different from its data type in sales_warehouse (date vs datetime). It might not be a problem if time is not important for the sales_warehouse database fact table.
Redundancy	Is only the necessary data being moved, transformed, and stored for use?	DESCRIBE Sales	The table Sales from the sales_warehouse database has a unique index constraint on OrderId, ProductId, UserId columns. It can be added to the warehouse schema.

Key takeaways

Testing data quality is an essential skill of a BI professional that ensures good analytics and reporting. Just as Sage does in this example, you can use SQL commands to examine BI databases and find potential problems. The sooner you know the problems in your system, the sooner you can fix them and improve your data quality.

Video: Mana: Quality data is useful data

Notes

Transcript

Mana is a Senior Technical Data Program Manager at Google. She helps create tools that enable business partners to make better decisions with data. She believes that good data is essential for making good decisions.

Quality testing is the process of making sure that data is accurate, relevant, representative, and timely. It is important to test data quality at every stage of the data pipeline, from extraction to transformation to loading.

Mana wishes she had been more confident in her skills when she was starting out. She believes that even if you don’t have all the traditional skills for a job, you can still be successful if you are curious, open, and humble.

She encourages people to stay curious and to learn from others. She believes that it is important to constantly evolve and become better than you were yesterday.

My name is Mana and I am a Senior Technical Data
Program Manager at Google. What that means is that I work with business
partners and I help create tools that
help enable them to make better
decisions with data. I’m a big data nerd, so I love playing around with
data and getting to build cool stuff that makes
people’s jobs a lot easier. There’s a common saying that in the absence of data
you have dirt. I like to take that
a step further. I like to say in the absence
of good data, you have dirt. Quality testing is all about how do you make sure
that you have good data. Good data can mean a number
of different things. Oftentimes, it means
accurate data. How do you ensure that the
numbers you are producing are correct and they’re
representative of the truth. It can also mean relevant data. It can also mean
representative data. It can also mean quick
data at your fingertips. The process of quality
control is making sure that the tools that
you’re building with respect to data are accurate, helpful, relevant, and timely. There are many times
across the life-cycle of building a BI product in which quality testing
comes into play. If you think of the very
early stages where maybe someone is trying to
extract data from logs, you’re going to want to
make sure that the data that you’re extracting
is accurate. It’s the same data that’s
coming in through the logs, the same data that’s
being spit out from the data mart maybe
that I’m creating. It’s the same idea through your ETL processes where ETL means extract,
transform and load. As you’re grabbing the data,
you’re transforming it, you’re massaging it, and you’re creating relevancy
with that data. You want to make sure
that the data you got in is the same data
you’re spitting out. I remember when I was
a young professional, I was always imagining that one day I would land
at a company that was data nirvana and all of
their data would be so clean and so perfect and
it was just magical, I could just query it with
no worries in the world. The truth is that nirvana
does not exist anywhere. Data always has issues. There are always
bugs that come up. Even the data that you were
looking at today might be different than the data
that you look at tomorrow. Embedding quality testing, not only as a one-off, but as a regular process in your pipelines or
whatever that you’re building is of
incredible importance because bugs are just
bound to happen. There are a couple
of things that I wish I knew when I
was starting out. I wish that I had
fully been able to recognize how many
skills I had in this department that came from
other parts of my life and weren’t the traditional
means of BI skills, and with that, having
more confidence in myself because I came in with really fantastic
storytelling skills that I was really
able to hone in on. Even though I wasn’t
a software engineer, I have a lot of experience
coding and I knew a lot of best practices that I
was able to put in place. I would say that if
there are areas that you feel like you’re maybe
not the strongest in, that if you stay curious and you stay open and you stay humble, and you find folks that are really great at those
and you ask them, hey, how do you do that? Know that you can learn and you can grow in
those capacities, and you can become
continuously well-rounded. It’s very normal
for us as humans to have our strengths
and our growth areas. But being successful is
not having them innately. It’s having the
ability to constantly evolve better yourself and
not become the expert, but just become better
than you were yesterday.

Practice Quiz: Test your knowledge: Optimize pipelines and ETL processes

What is the business intelligence process that involves checking data for defects in order to prevent system failures?

Quality testing

Quality testing is the business intelligence process that involves checking data for defects in order to prevent system failures.

Fill in the blank: Completeness is a quality testing step that involves confirming that the data contains all desired __ or components.

Measures

Completeness is a quality testing step that involves confirming that the data contains all desired measures or components.

AI assistant

The correct answer is: Fields

Data completeness is the extent to which all required data fields are present in a dataset. This means that there are no missing values or null entries for any of the fields. For example, if a dataset contains a field for customer name, all of the records in the dataset should have a value for that field.

Completeness is an important aspect of data quality because it ensures that the data is accurate and reliable. If data is incomplete, it can lead to incorrect or misleading results. For example, if a dataset is missing customer names, it would be difficult to identify and track customers.

There are a number of ways to test for data completeness. One way is to manually review the data for missing values. Another way is to use data quality tools to identify missing values. Data quality tools can also be used to identify other data quality issues, such as inaccurate or duplicate data.

By ensuring that data is complete, businesses can improve the accuracy and reliability of their data-driven decisions.

A business intelligence professional is considering the integrity of their data throughout its life cycle. Which of the following goals do they aim to achieve? Select all that apply.

Data is accurate and complete, Data is consistent, Data is trustworthy

Their goal is to achieve data accuracy, completeness, consistency, and trustworthiness. These are the key considerations of data integrity.

AI assistant

The correct answers are:

Data is accurate and complete
Data is consistent
Data is trustworthy

A business intelligence professional who is considering the integrity of their data throughout its life cycle aims to achieve the following goals:

Accuracy and completeness: Data should be free from errors and omissions. All required data fields should be present, and the values in those fields should be correct.
Consistency: Data should be consistent across all sources and systems. For example, the same customer should have the same name and address in all databases.
Trustworthiness: Data should be reliable and believable. Users should be confident that the data is accurate and that it can be used to make sound decisions.

In addition to these goals, a business intelligence professional may also aim to achieve the following:

Security: Data should be protected from unauthorized access, use, disclosure, disruption, modification, or destruction.
Timeliness: Data should be available when needed.
Accessibility: Data should be easy to find and use.

By achieving these goals, a business intelligence professional can help ensure that their data is of high quality and can be used to support effective decision-making.

Data schema validation

Video: Conformity from source to destination

Notes

Tutorial

Quiz

Transcript

Schema validation is a process of ensuring that the source system data schema matches the target database data schema. Schema validation properties should ensure three things: the keys are still valid after transformation, the table relationships have been preserved, and the conventions are consistent across the database.

Data dictionaries and data lineages are documentation tools that support data schema validation. A data dictionary is a collection of information that describes the content, format, and structure of data objects within a database, as well as their relationships. Data lineage describes the origin of data, where it has moved throughout the system, and how it has transformed over time.

Using schema validation, data dictionaries, and data lineages helps BI professionals promote consistency as data is moved from the source to destination.

Conformity from source to destination in Business Intelligence

In Business Intelligence (BI), conformity refers to ensuring that data is consistent as it is moved from source to destination. This is important for a number of reasons, including:

Ensuring that data is accurate and reliable
Enabling users to easily understand and compare data from different sources
Facilitating the creation of accurate and consistent reports and dashboards

There are a number of tools and techniques that can be used to achieve conformity from source to destination in BI. These include:

Schema validation: Schema validation is the process of ensuring that the structure of the data in the source system matches the structure of the data in the destination system. This can be done using a variety of tools, such as data profiling tools and data modeling tools.
Data mapping: Data mapping is the process of defining how data elements in the source system are related to data elements in the destination system. This can be done manually or using data mapping tools.
Data cleansing: Data cleansing is the process of identifying and correcting errors in data. This can be done manually or using data cleansing tools.
Data transformation: Data transformation is the process of converting data from one format to another. This can be done using a variety of tools, such as ETL (extract, transform, load) tools and data wrangling tools.
Data quality checks: Data quality checks are used to identify and measure the quality of data. This can be done using a variety of tools, such as data profiling tools and data quality monitoring tools.

By using these tools and techniques, BI professionals can ensure that data is consistent as it is moved from source to destination. This helps to ensure that data is accurate, reliable, and easy to use.

Here are some additional tips for achieving conformity from source to destination in BI:

Standardize data formats: As much as possible, use standard data formats for both source and destination systems. This will make it easier to map data between systems and reduce the need for data transformation.
Use common naming conventions: Use common naming conventions for data elements in both source and destination systems. This will make it easier for users to understand and compare data from different sources.
Document data transformations: Document any data transformations that are performed. This will help to ensure that data is transformed consistently and that users understand how data has been transformed.
Test data transformations: Test data transformations to ensure that they are working correctly. This will help to prevent errors in transformed data.
Monitor data quality: Monitor data quality to identify and correct errors in data. This will help to ensure that data is accurate and reliable.

By following these tips, BI professionals can help to ensure that data is consistent as it is moved from source to destination. This helps to ensure that data is accurate, reliable, and easy to use.

Fill in the blank: A _____ is a collection of information that describes the content, format, and structure of data objects within a database, as well as their relationships.

data dictionary

A data dictionary is a collection of information that describes the content, format, and structure of data objects within a database, as well as the relationships.

You’re learning a lot
about the importance of quality testing and ETL, and you now know that a key part of the
process is checking for conformity or whether the data fits the required
destination format. To ensure conformity, from source to destination, BI professionals have three
very effective tools; schema validation, data
dictionaries and data lineages. In this video, we’ll
examine how they can help you establish
consistent data governance. First, schema validation is
a process to ensure that the source system data schema matches the target
database data schema. As you’re learning, if
the schemas don’t align, this can cause system failures that are very difficult to fix. Building schema validation
into your workflow, is important to
prevent these issues. Database tools offer various schema validation
options that can be used to check incoming data against the destination
schema requirements. For example, you
could dictate that a certain column contains
only numerical data. Then if you try to enter something in that column
that doesn’t conform, the system will flag the error. Or in a relational database, you could specify
that an ID number must be a unique field. That means the same ID can’t be added if it matches
an existing entry. This prevents
redundancies in the data. With these properties
and action, if the data doesn’t conform
and throws an error, you’ll be alerted, or if
it meets the requirements, you’ll know it’s valid
and safe to load. Schema validation properties
should ensure three things. The keys are still valid
after transformation. The table relationships
have been preserved, and the conventions are
consistent across the database. Let’s start with the keys. As you’ve been learning,
relational databases use primary and foreign keys to build relationships
among tables. These keys should
continue to function after you’ve moved data from
one system into another. For example, if
your source system uses customer_id as a key, then that needs to be valid and the target schema as well. This is related to the next property of schema validation, making sure the table
relationships have been preserved. When taking in data
from a source system, it’s important that
these keys remain valid in the target system so
the relationships can still be used to
connect tables or that they are transformed to
match the target schema. For example, if the
customer_id key doesn’t apply to
our target system then all of the tables
that used it as a primary or foreign
key are disconnected. If there are relationships
between tables have been broken and while
data is being moved, then data becomes hard
to access and use, and that’s the whole reason we moved it to our target system. Finally, you want to ensure that the conventions are consistent with the target
database’s schema. Sometimes data from
outside sources uses different conventions for
naming columns and tables. For example, you could have
a source system that uses employee ID as one word
to identify that field, but the target database
might use employee_id. You’ll need to ensure these
are consistent so you don’t get errors when trying to
pull data for analysis. In addition to the
properties themselves, there are some other
documentation tools that support data
schema validation; data dictionaries
and data lineages. A data dictionary
is a collection of information that
describes the content, format and structure of data
objects within a database, as well as their relationships. You might also
hear this referred to as a metadata repository. You may know that metadata
is data about data. This is a very important
concept in BI, so if you’d like to review
some of the lessons about metadata from the Google
Data Analytics certificate, go ahead and do that now. In the case of
data dictionaries, these represent metadata because they’re basically using
one type of data, metadata, to define the use and origin of another
piece of data. There are several reasons
you might want to create a data dictionary
for your team. For one thing, it helps avoid inconsistencies
throughout a project. In addition, it enables you to define any conventions that other team members
need to know in order to create more
alignment across teams. Best of all, it makes the
data easier to work with. Now, let’s explore
data lineages. Data lineage describes a process of identifying the
origin of data, where it has moved
throughout the system, and how it has
transformed over time. This is useful because
if you do get an error, you can track the lineage
of that piece of data, and understand what happened along the way to
cause the problem. Then, you can put standards in place to avoid the same
issue in the future. Using schema validation,
data dictionaries, and data lineages really helps
BI professionals promote consistency as data is moved from the source
to destination. This means all users can be confident in the BI
solutions being created. We’ll keep exploring
these concepts soon.

Reading: Sample data dictionary and data lineage

Reading

As you have been learning in this course, business intelligence professionals have three primary tools to help them ensure conformity from source to destination: schema validation, data dictionaries, and data lineages. In this reading, you’re going to explore some examples of data dictionaries and lineages to get a better understanding of how these items work.

Data dictionaries

A data dictionary is a collection of information that describes the content, format, and structure of data objects within a database, as well as their relationships. This can also be referred to as a metadata repository because data dictionaries use metadata to define the use and origin of other pieces of data. Here’s an example of a product table that exists within a sales database:

Product Table

Item_ID	Price	Department	Number_of_Sales	Number_in_Stock	Seasonal
47257	$33.00	Gardening	744	598	Yes
39496	$82.00	Home Decor	383	729	Yes
73302	$56.00	Furniture	874	193	No
16507	$100.00	Home Office	310	559	Yes
1232	$125.00	Party Supplies	351	517	No
3412	$45.00	Gardening	901	942	No
54228	$60.00	Party Supplies	139	520	No
66415	$38.00	Home Decor	615	433	Yes
78736	$12.00	Grocery	739	648	No
34369	$28.00	Gardening	555	389	Yes

This table is actually the final target table for data gathered from multiple sources. It’s important to ensure consistency from the sources to the destination because this data is coming from different places within the system. This is where the data dictionary comes in:

Data dictionary

Name	Definition	Data Type
Item_ID	ID number assigned to all product items in-store	Integer
Price	Current price of product item	Integer
Department	Which department the product item belongs to	Character
Number_of_Sales	The current number of product items sold	Integer
Number_in_Stock	The current number of product items in stock	Integer
Seasonal	Whether or not the product item is only seasonally available	Boolean

You can use the properties outlined in the dictionary to compare incoming data to the destination table. If any data objects don’t match the entries in the dictionary, then the data validation will flag the error before the incorrect data is ingested.

For example, if incoming data that is being delivered to the Department column contains numerical data, you can quickly identify that there has been an error before it gets delivered because the data dictionary states Department data should be character-type.

Data lineages

A data lineage describes the process of identifying the origin of data, where it has moved throughout the system, and how it has transformed over time. This can be really helpful for BI professionals, because when they do encounter an error, they can actually track it to the source using the lineage. Then, they can implement checks to prevent the same issue from occuring again.

For example, imagine your system flagged an error with some incoming data about the number of sales for a particular item. It can be hard to find where this error occurred if you don’t know the lineage of that particular piece of data– but by following that data’s path through your system, you can figure out where to build a check.

By tracking the sales data through its life cycle in the system, you find that there was an issue with the original database it came from and that data needs to be transformed before it’s ingested into later tables.

Key takeaways

Tools such as data dictionaries and data lineages are useful for preventing inconsistencies as data is moved from source systems to its final destination. It is important that users accessing and using that data can be confident that it is correct and consistent. This depends on the data being delivered into the target systems has already been validated. This is key for building trustworthy reports and dashboards as a BI professional!

Video: Check your schema

Notes

Tutorial

Transcript

In this case study, an educational non-profit is ingesting data from school databases in order to evaluate learning goals, national education statistics, and students surveys.

The non-profit uses a data dictionary and lineage to establish the necessary standards for data consistency. The data dictionary records four specific properties for each column: the name of the column, its definition, the datatype, and possible values. The data lineage includes information about the data’s origin, where it is moved throughout the system, and how it has transformed over time.

The schema validation process flags an error for a piece of data that is not integer type. The data lineage is used to trace the journey of this piece of data and find out where in the process a quality check should be added.

In this case, the data was not type casted correctly when it was input into the schools’ original database. The transformation process in the pipeline does not include type casting.

The non-profit can improve their systems and prevent errors by incorporating type casting into the pipeline before data is read into the destination table.

Checking your schema in Business Intelligence (BI)

A schema is a set of rules that defines the structure and relationships of data in a database. In BI, it is important to check your schema regularly to ensure that it is accurate and up-to-date. This helps to prevent errors and ensure that data is consistent and reliable.

There are a number of ways to check your schema in BI. These include:

Using a data profiling tool: A data profiling tool can be used to analyze the structure and contents of your data. This can help you to identify any errors or inconsistencies in your schema.
Manually reviewing your schema: You can also manually review your schema to check for errors. This is a good way to ensure that your schema is complete and accurate.
Using a schema validation tool: A schema validation tool can be used to check your schema against a set of rules. This can help you to identify any violations of your schema rules.
Using a data quality tool: A data quality tool can be used to check the quality of your data. This can help you to identify any errors or inconsistencies in your data that may be caused by problems with your schema.

By using these methods, you can ensure that your schema is accurate and up-to-date. This helps to prevent errors and ensure that data is consistent and reliable.

Here are some additional tips for checking your schema in BI:

Document your schema: Documenting your schema helps to ensure that everyone who uses your data understands its structure and relationships. This can help to prevent errors and make it easier to maintain your schema.
Use standard naming conventions: Using standard naming conventions for your data elements helps to make your schema more readable and understandable. This can help to prevent errors and make it easier to maintain your schema.
Test your schema: Testing your schema helps to ensure that it is working correctly. This can be done by creating test data and running it through your BI system.
Monitor your schema: Monitoring your schema helps to ensure that it remains accurate and up-to-date. This can be done by tracking changes to your data and updating your schema accordingly.

By following these tips, you can help to ensure that your schema is accurate, up-to-date, and easy to use.

One of the best ways to learn
is through a case study. When you witness how something happened at an
actual organization, it really brings ideas
and concepts to life. In this video, we’re
going to check out schema governance in action
at an educational non-profit. In this scenario,
decision-makers at the non-profit are interested in measuring educational
outcomes in their community. In order to do this, they
are ingesting data from school databases in order
to evaluate learning goals, national education statistics,
and students surveys. Because they’re
pulling data from multiple sources into
their own database system, it’s important that they
maintain consistency across all the data to prevent errors and avoid losing
important information. Luckily, this
organization already has a data dictionary and lineage in place to establish the
necessary standards. Let’s check out an example of a column from the student
information table. This table has five columns. Student ID, school system, school, age, and
grade point average. Each column in this table
has been recorded in the data dictionary to specify what information
it contains, so we can go to the data
dictionary entry for the school system column to double-check the
standards for this table. As a refresher, a
data dictionary is a collection of information
that describes the content, format, and structure of data objects within a database
and their relationships. This dictionary records
four specific properties, the name of the column,
its definition, the datatype, and
possible values. The dictionary entry
for age lets us know the data objects in this column contain information
about a student’s age. It also tells us that this
is integer type data. We can use these
properties to compare incoming data to the
destination table. If any data objects
aren’t integer type data, then the schema validation
will flag the error before the incorrect data is ingested into the destination. What happens when a data object fails to schema
validation process? We can actually use
the data lineage to trace the journey
of this piece of data and find out where in the process we might want
to add a quality check. Again, a data lineage includes information
about the data’s origin, where it is moved
throughout the system, and how it has
transformed over time. During the schema
validation process, this piece of data through
an error because it isn’t currently cast
as integer type. When we check the data lineage, we can track this object’s
movement through our system. This data started in an
individual schools database before being read into the
school systems database. The individual schools
data was ingested by our pipeline along with data
from other school systems, and then organize
and transformed during the movement process. Apparently, when this data was input in the schools
original database, it wasn’t type casted correctly. We can confirm that by checking its datatype throughout
the lineage. Lineage also includes
all the transformations that this data has
undergone so far. Also, we might at this point notice that type casting
isn’t built into our transformation
process during quality checks.
That’s great news. Now we know that that’s a process we should
incorporate into the pipeline before data is read into the destination table. In this case, age data objects
should be integer type, and that’s how schema
governance and validation can help improve systems
and prevent errors. There are other tests
that should be applied to a pipeline to make sure
it’s functioning correctly, which we’ll learn
more about soon. But now you have a better understanding of how to validate the schema and keep improving
your pipeline processes.

Reading: Schema-validation checklist

Reading

In this course, you have been learning about the tools business intelligence professionals use to ensure conformity from source to destination: schema validation, data dictionaries, and data lineages. In another reading, you already had the opportunity to explore data dictionaries and lineages. In this reading, you are going to get a schema validation checklist you can use to guide your own validation process.

Schema validation is a process used to ensure that the source system data schema matches the target database data schema. This is important because if the schemas don’t align, it can cause system failures that are hard to fix. Building schema validation into your workflow is important to prevent these issues.

Common issues for schema validation

The keys are still valid: Primary and foreign keys build relationships between tables in relational databases. These keys should continue to function after you have moved data from one system into another.
The table relationships have been preserved: The keys help preserve the relationships used to connect the tables so that keys can still be used to connect tables. It’s important to make sure that these relationships are preserved or that they are transformed to match the target schema.
The conventions are consistent: The conventions for incoming data must be consistent with the target database’s schema. Data from outside sources might use different conventions for naming columns in tables– it’s important to align these before they’re added to the target system.

Using data dictionaries and lineages

You’ve already learned quite a bit about data dictionaries and lineages. As a refresher, a data dictionary is a collection of information that describes the content, format, and structure of data objects within a database, as well as their relationships. And a data lineage is the process of identifying the origin of data, where it has moved throughout the system, and how it has transformed over time. These tools are useful because they can help you identify what standards incoming data should adhere to and track down any errors to the source.

The data dictionaries and lineages reading provided some additional information if more review is needed.

Key takeaways

Schema validation is a useful check for ensuring that the data moving from source systems to your target database is consistent and won’t cause any errors. Building in checks to make sure that the keys are still valid, the table relationships have been preserved, and the conventions are consistent before data is delivered will save you time and energy trying to fix these errors later on.

Practice Quiz: Activity: Evaluate a schema using a validation checklist

Reading

Activity_-Evaluate-a-schema-using-a-validation-checklist Download

Activity-Template_-Database-schema Download

Reading: Activity Exemplar: Evaluate a schema using a validation checklist

Reading

Completed Exemplar

To review the exemplar for this course item, click the following link and select Use Template.

Link to exemplar: Database schema exemplar

Assessment of Exemplar

Compare the exemplar to your completed activity. Review your work using each of the criteria in the exemplar. What did you do well? Where can you improve? Use your answers to these questions to guide you as you continue to progress through the course.

In the schema you evaluated in this activity, the Sales Fact table is a central table that contains key figures from the transactions. It also contains an internal key to the dimension it’s linked to. This is a common schema structure in BI data warehouse systems.

The original schema contains eight tables: Sales Fact, Shipments, Billing, Order Items, Product, Product Price, Order Details, and Customer, which are connected via keys.

The central table is Sales Fact. The foreign keys in the Sales Fact table link to the other tables as follows:

“order_sid” key links to the Order Items,Order Details, Shipments, and Billing tables
“customer_sid” links to Order Details; “order_item_sid” links to Order Items, Shipments, and Billing
“shipment_sid” links to Shipments; and “billing_sid” links to Billing
“product_id” from the Product table links to Order Items and Product Price

The Customer table currently doesn’t have any links to other tables. It contains the following columns: “customer_sid,” “customer_name,” and “customer_type.”

This schema chart includes the following problems:

The Customer table is not linked to any other tables. It should be linked to Sales Facts and Order Details tables. This violates the “Keys are still valid” and Table relationships have been preserved” checks.
The Shipments table should be connected to the Order Items table through the “order_sid” dimension. This violates the “Keys are still valid” and Table relationships have been preserved” checks.

The exemplar in this reading is an example of the schema you evaluated, but with its errors fixed. It links the Customer table to the Sales Facts and Order Details tables through the customer_sid dimension. It has a connection between the Shipment and Order Items tables. It also has consistent naming conventions for “product_sid.” Consistent naming for column titles is not mandatory, but it is a best practice to keep titles as consistent as possible.

The important dimensions that are connections in this schema are order_sid, order_item_sid, customer_sid, product_sid, shipment_sid, and billing_sid.

Order_sid is present in the Sales Facts, Order Items, Shipments, Billing, and Order Details tables.
Order_item_sid is present in the Sales Facts, Order Items, Shipments, and Billing tables.
Customer_sid is present in the Sales Facts, Order Details, and Customer tables.
Product_sid is present in the Order Items, Product, and Product Price tables.
Shipment_sid is present in the Sales Facts and Shipment tables.
Billing_sid is present in the Sales Facts and Billing tables.

PDF

Activity-Exemplar_-Database-schema Download

Practice Quiz: Test your knowledge: Data schema validation

A team of business intelligence professionals builds schema validation into their workflows. In this situation, what goal do they want to achieve?

Ensure the source system data schema matches the target system data schema

They want to ensure the source system data schema matches the target system data schema.

Why is it important to ensure primary and foreign keys continue to function after data has been moved from one database system to another?

To preserve the existing table relationships

It is important to ensure primary and foreign keys continue to function after data has been moved from one database system to another in order to preserve the existing table relationships.

Fill in the blank: A _ describes the process of identifying the origin of data, where it has moved throughout the system, and how it has transformed over time.

data lineage

A data lineage describes the process of identifying the origin of data, where it has moved throughout the system, and how it has transformed over time.

Business rules and performance testing

Video: Verify business rules

Notes

Tutorial

Quiz

Transcript

Business rules are a critical aspect of ensuring that databases meet the needs of a business. By verifying that data complies with business rules, BI professionals can ensure that databases are performing as intended. Business rules are different for every organization and can change over time, so it is important to keep a record of what rules exist and why. Verifying business rules is similar to schema validation, but it involves comparing incoming data to the business rules of the organization. By verifying business rules, BI professionals can help ensure that databases are providing accurate and relevant information to stakeholders.

Verifying Business Rules in Business Intelligence

Business rules are statements that define or constrain some aspect of a business. In the context of business intelligence (BI), business rules are used to ensure that data is accurate, consistent, and compliant with the organization’s policies and procedures.

There are a number of ways to verify business rules in BI. One common approach is to use a business rule repository. A business rule repository is a central storehouse for all of the business rules that are used by an organization. The repository can be used to store the rules themselves, as well as metadata about the rules, such as the rule owner, the date the rule was created, and the rule’s status.

Another approach to verifying business rules is to use data quality tools. Data quality tools can be used to identify and correct data errors. They can also be used to ensure that data is consistent with business rules.

For example, a data quality tool could be used to identify duplicate records in a customer database. The tool could then be used to merge the duplicate records into a single record. This would ensure that the customer database is accurate and consistent.

In addition to using data quality tools, BI professionals can also verify business rules by manually reviewing data. This can be a time-consuming process, but it can be necessary to ensure that data is accurate and compliant with business rules.

Benefits of verifying business rules

There are a number of benefits to verifying business rules in BI. These benefits include:

Improved data quality: By verifying business rules, BI professionals can ensure that data is accurate, consistent, and compliant with the organization’s policies and procedures.
Reduced risk: By identifying and correcting data errors, BI professionals can reduce the risk of making bad decisions based on inaccurate data.
Increased efficiency: By ensuring that data is consistent, BI professionals can make it easier for users to find and use data.
Improved compliance: By verifying that data is compliant with business rules, BI professionals can help organizations meet their compliance obligations.

How to verify business rules

The following are some steps that BI professionals can take to verify business rules:

Identify the business rules that need to be verified. This can be done by reviewing the organization’s policies and procedures, as well as by interviewing stakeholders.
Determine how the business rules will be verified. This could involve using a business rule repository, data quality tools, or manual review.
Gather the data that needs to be verified. This could involve extracting data from operational systems, or it could involve collecting data from users.
Verify the data against the business rules. This could involve using automated tools or manual review.
Document the results of the verification process. This could involve creating a report that summarizes the findings of the verification process.

Conclusion

Verifying business rules is an important part of ensuring that data is accurate, consistent, and compliant with the organization’s policies and procedures. By taking the steps outlined above, BI professionals can help ensure that their organizations are getting the most out of their data.

Fill in the blank: A business rule is a statement that creates a _____ on specific parts of a database.

restriction

A business rule is a statement that creates a restriction on specific parts of a database. It helps prevent errors within the system.

So far, we’ve learned a lot about
database performance, quality testing and schema validation and
how these checks ensure the database and pipeline system continue
to work as expected. Now we’re going to explore another
important check, making sure that the systems and processes you have
created actually meet business needs. This is essential for ensuring that those systems continue
to be relevant to your stakeholders. To do this by professionals
verify business rules. In BI, A business rule is a statement that
creates a restriction on specific parts of a database. For example, a shipping database might
impose a business rule that states shipping dates can’t
come before order dates. This prevents order dates and
shipping dates from being mixed up and causing errors within this system. Business rules are created according
to the way a particular organization uses its data in a previous video,
we discovered how important it can be to observe how a business uses data
before building up database system. Understanding the actual needs,
guides, design. And this is true for business rules too. The business rules you create will
affect a lot of the databases, design what data is collected and
stored, how relationships are defined, what kind of information the database
provides and the security of the data. This helps ensure the database
is performing as intended. Business rules are different in
every organization because the way organizations interact with
their data is always different. Plus business rules
are also always changing. Which is why keeping a record of what
rules exist and why is critical. Here’s another example:
consider a library database. The primary need of the users
who are librarians in this case is to check out books and maintain
information about patrons. Because of this, there are a few business rules this
library might impose on the database to regulate the system. One rule could be that library patrons
cannot check out more than five books at a time. The database won’t let a user check out
a sixth book. Or the database could have a rule that the same book cannot be checked
out by two people at the same time. If someone tries, the librarians would
be alerted that there’s a redundancy. Another business rule could be that
specific information must be entered into the system for a new book to
be added to the library inventory. Basically verification involves ensuring
that data imported into the target database complies with
business rules on top of that. These rules are important pieces
of knowledge that help a BI professional understand how a business and its processes function. This helps the BI professional become a subject matter expert and trusted advisor. As you’re probably noticing this process
is very similar to schema validation. In schema validation, you take the target database’s schema and
compare incoming data to it. Data that fails this check is not ingested
into the destination database. Similarly, you will compare incoming data
to the business rules before loading it into the database. In our library example,
if a patron puts in a request for a book but they already have more than
five books from the library, then this incoming data doesn’t comply with the
preset business rule, and it prevents them from checking out the book. And that’s
the basics about verifying business rules. These checks are important because they
ensure that databases do their jobs as intended. And because business rules are
so integral to the way databases function, verifying that they’re working
correctly is very important. Coming up, you’ll get a chance to
explore business rules in more detail.

Reading: Business rules

Reading

As you have been learning, a business rule is a statement that creates a restriction on specific parts of a database. These rules are developed according to the way an organization uses data. Also, the rules create efficiencies, allow for important checks and balances, and also sometimes exemplify the values of a business in action. For instance, if a company values cross-functional collaboration, there may be rules about at least 2 representatives from two teams checking off completion on some data set. They affect what data is collected and stored, how relationships are defined, what kind of information the database provides, and the security of the data. In this reading, you will learn more about the development of business rules and see an example of business rules being implemented in a database system.

Imposing business rules

Business rules are highly dependent on the organization and their data needs. This means business rules are different for every organization. This is one of the reasons why verifying business rules is so important; these checks help ensure that the database is actually doing the job you need it to do. But before you can verify business rules, you have to implement them.

For example, let’s say the company you work for has a database that manages purchase order requests entered by employees. Purchase orders over $1,000 dollars need manager approval. In order to automate this process, you can impose a ruleset on the database that automatically delivers requests over $1,000 to a reporting table pending manager approval. Other business rules that may apply in this example are: prices must be numeric values (data type should be integer); or for a request to exist, a reason is mandatory (table field may not be null).

In order to fulfill this business requirement, there are three rules at play in this system:

Order requests under $1,000 are automatically delivered to the approved product order requests table
Requests over $1,000 are automatically delivered to the requests pending approval table
Approved requests are automatically delivered to the approved product order requests table

These rules inherently affect the shape of this database system to cater to the needs of this particular organization.

Verifying business rules

Once the business rules have been implemented, it’s important to continue to verify that they are functioning correctly and that data being imported into the target systems follows these rules. These checks are important because they test that the system is doing the job it needs to, which in this case is delivering product order requests that need approval to the right stakeholders.

Key takeaways

Business rules determine what data is collected and stored, how relationships are defined, what kind of information the database provides, and the security of the data. These rules heavily influence how a database is designed and how it functions after it has been set up. Understanding business rules and why they are important is useful as a BI professional because this can help you understand how existing database systems are functioning, design new systems according to business needs, and maintain them to be useful in the future.

Reading: Database performance testing in an ETL context

Reading

In previous lessons, you learned about database optimization as part of the database building process. But it’s also an important consideration when it comes to ensuring your ETL and pipeline processes are functioning properly. In this reading, you are going to return to database performance testing in a new context: ETL processes.

How database performance affects your pipeline

Database performance is the rate that a database system is able to provide information to users. Optimizing how quickly the database can perform tasks for users helps your team get what they need from the system and draw insights from the data that much faster.

Your database systems are a key part of your ETL pipeline– these include where the data in your pipeline comes from and where it goes. The ETL or pipeline is a user itself, making requests of the database that it has to fulfill while managing the load of other users and transactions. So database performance is not just key to making sure the database itself can manage your organization’s needs– it’s also important for the automated BI tools you set up to interact with the database.

Key factors in performance testing

Earlier, you learned about some database performance considerations you can check for when a database starts slowing down. Here is a quick checklist of those considerations:

Queries need to be optimized
The database needs to be fully indexed
Data should be defragmented
There must be enough CPU and memory for the system to process requests

You also learned about the five factors of database performance: workload, throughput, resources, optimization, and contention. These factors all influence how well a database is performing, and it can be part of a BI professional’s job to monitor these factors and make improvements to the system as needed.

These general performance tests are really important– that’s how you know your database can handle data requests for your organization without any problems! But when it comes to database performance testing while considering your ETL process, there is another important check you should make: testing the table, column, row counts, and Query Execution Plan.

Testing the row and table counts allows you to make sure that the data count matches between the target and source databases. If there are any mismatches, that could mean that there is a potential bug within the ETL system. A bug in the system could cause crashes or errors in the data, so checking the number of tables, columns, and rows of the data in the destination database against the source data can be a useful way to prevent that.

Key takeaways

As a BI professional, you need to know that your database can meet your organization’s needs. Performance testing is a key part of the process. Not only is performance testing useful during database building itself, but it’s also important for ensuring that your pipelines are working properly as well. Remembering to include performance testing as a way to check your pipelines will help you maintain the automated processes that make data accessible to users!

Upgraded Plugin: Evaluate: Performance test your data pipeline

Reading

Reading: Defend against known issues

Reading

In this reading, you’ll learn about a defensive check applied to a data pipeline. Defensive checks help you prevent problems in your data pipeline. They are similar to performance checks but focus on other kinds of problems. The following scenario will provide an example of how you can implement different kinds of defensive checks on a data pipeline.

Scenario

Arsha, a Business Intelligence Analyst at a telecommunications company, built a data pipeline that merges data from six sources into a single database. While building her pipeline, she incorporated several defensive checks that ensured that the data was moved and transformed properly.

Her data pipeline used the following source systems:

Customer details
Mobile contracts
Internet and cable contracts
Device tracking and enablement
Billing
Accounting

All of these datasets had to be harmonized and merged into one target system for business intelligence analytics. This process required several layers of data harmonization, validation, reconciliation, and error handling.

Pipeline layers

Pipelines can have many different stages of processing. These stages, or layers, help ensure that the data is collected, aggregated, transformed, and staged in the most effective and efficient way. For example, it’s important to make sure you have all the data you need in one place before you start cleaning it to ensure that you don’t miss anything. There are usually four layers to this process: staging, harmonization, validation, and reconciliation. After these four layers, the data is brought into its target database and an error handling report summarizes each step of the process.

Staging layer

First, the original data is brought from the source systems and stored in the staging layer. In this layer, Arsha ran the following defensive checks:

Compared the number of records received and stored
Compared rows to identify if extra records were created or records were lost
Checked important fields, such as amounts, dates, and IDs

Arsha moved the mismatched records to the error handling report. She included each unconverted source record, the date and time of its first processing, its last retry date and time, the layer where the error happened, and a message describing the error. By collecting these records, Arsha was able to find and fix the origin of the problems. She marked all of the records that moved to the next layer as “processed.”

Harmonization layer

The harmonization layer is where data normalization routines and record enrichment are performed. This ensures that data formatting is consistent across all the sources. To harmonize the data, Arsha ran the following defensive checks:

Standardized the date format
Standardized the currency
Standardized uppercase and lowercase stylization
Formatted IDs with leading zeros
Split date values to store the year, month, and day in separate columns
Applied conversion and priority rules from the source systems

When a record couldn’t be harmonized, she moved it to Error Handling. She marked all of the records that moved to the next layer as “processed.”

Validations layer

The validations layer is where business rules are validated. As a reminder, a business rule is a statement that creates a restriction on specific parts of a database. These rules are developed according to the way an organization uses data. Arsha ran the following defensive checks:

Ensured that values in the “department” column were not null, since “department” is a crucial dimension
Ensured that values in the “service type” column were within the authorized values to be processed
Ensured that each billing record corresponded to a valid processed contract

Again, when a record couldn’t be validated, she moved it to error handling. She marked all the records that moved to the next layer as “processed.”

Reconciliation layer

The reconciliation layer is where duplicate or illegitimate records are found. Here, Arsha ran defensive checks to find the following types of records:

Slow-changing dimensions
Historic records
Aggregations

As with the previous layers, Arsha moved the records that didn’t pass the reconciliation rules to Error Handling. After this round of defensive checks, she brought the processed records into the BI and Analytics database (OLAP).

Error handling reporting and analysis

After completing the pipeline and running the defensive checks, Arsha made an error handling report to summarize the process. The report listed the number of records from the source systems, as well as how many records were marked as errors or ignored in each layer. The end of the report listed the final number of processed records.

Key takeaways

Defensive checks are what ensure that a data pipeline properly handles its data. Defensive checks are an essential part of preserving data integrity. Once the staging, harmonization, validations, and reconciliation layers have been checked, the data brought into the target database is ready to be used in a visualization.

Video: Burak: Evolving technology

Notes

Transcript

Burak is a BI Engineer in Google. He moved to the US five years ago and had a hard time getting his first job. He realized that he needed to learn new skills to meet the demands of the US market, so he started teaching himself what a BI Engineer does and what technical skills he needs.

He found a lot of free and paid resources online and learned the skills he needed to get his first job. He emphasizes that BI technology has evolved a lot in the past 10 years, and that spreadsheets are no longer the go-to tool for data analysis. Instead, cloud technologies, more sophisticated languages, and more powerful visualization tools are now being used.

Burak believes that it is important to keep up with the latest technology trends in order to be successful in BI engineering. He recommends following online resources, web pages, and newsletters to stay informed. However, he also emphasizes that the essentials of BI engineering remain the same. Once you learn a SQL language, the skills will be transferable to other organizations, even if the specific SQL dialect is different.

Burak’s number one piece of advice is to be dedicated. BI engineering requires a lot of technical skills, so it is important to be willing to put in the time and effort to learn them. He also recommends identifying which area of BI engineering you want to focus on, as there are several different areas that you can work in. The technical skills you need to learn will vary depending on which area you choose.

My name is Burak. I’m a BI Engineer in Google. What a BI Engineer does
is actually collecting, storing, and analyzing data with a lot of
infrastructure involved. I moved to States five years
ago as a fresh immigrant. It was very difficult to
get the first job for me. Even though I have a degree, I still needed to a lot
of training to get ready for the domestic
needs of the market. I started teaching myself
what a BI engineer does? What are the essential technical skills that
I need to learn? I did a lot of online research. I found a lot of free stuff, I found a couple of paid subscriptions and
I started learning all these technical skills that I need to learn to
get my first job. BI technology evolved a lot
over the last 10 years. I can say spreadsheets
very in 10 years ago, but nobody is a specialist anymore in terms
of data analysis. I mean, it’s still being
used for quick analysis, but now everything has
evolved into Cloud, technologies, more
sophisticated languages, more powerful visual
tools to help everyone. Personally, I have to adapt, changing new environments and new languages and
new infrastructures. In order to keep up
with the technology, you actually need to do a
little bit research and understand how the industry
is actually evolving. There are lots of
online resources or web pages or
newsletters that you can follow about the
latest industrial trends. But the essentials,
they never change. The foundation of the BI
engineering is the same. If you learn a SQL language, whatever organization you work for is going to be a
different SQL language, is going to be
different dialects. You still need to catch up with, but the amount of time that
you need to learn is going to be drastically less than the amount of time
you first start. Because there’s going to
be transferable skills. My number one piece advice
will be dedication. If you really love
working with data and if you would like to
work in this field, dedication is one of the key requirements
because there is a lot of technical
skills that you need to learn and it needs time. Therefore, I
recommend to identify which area of BI engineering that you actually
want to focus on, because there are several
parts that you can work. You can build infrastructures, or you can design systems, or you can analyze data, or you can do some
visualization tools. Depending on which area you
would like to focus on, the technical skills you
need to learn will vary.

Reading: Case study: FeatureBase, Part 2: Alternative solutions to pipeline systems

Reading

Case-study_-FeatureBase-Part-2_-Alternative-solutions-to-pipeline-s Download

Practice Quiz: Test your knowledge: Business rules and performance testing

A business intelligence professional considers what data is collected and stored in a database, how relationships are defined, the type of information the database provides, and the security of the data. What does this scenario describe?

Considering the impact of business rules

This scenario describes establishing business rules. A business rule is a statement that creates a restriction on specific parts of a database. It helps determine if a database is performing as intended.

At which point in the data-transfer process should incoming data be compared to business rules?

Before loading it into the database

During the data-transfer process, incoming data should be compared to business rules before loading it into the database.

Review: Optimize ETL processes

Video: Wrap-up

Notes

Transcript

As a BI professional, it is important to ensure that the database systems and pipeline tools you build for your organization continue to work as intended and handle potential errors before they become problems. This involves:

Quality testing in an ETL system to check incoming data for completeness, consistency, conformity, accuracy, redundancy, integrity, and timeliness.
Schema governance and schema validation to prevent incoming data from causing errors in the system by making sure it conforms to the schema of properties of the destination database.
Verifying business rules to ensure that the incoming data meets the business needs of the organization using it.
Database optimization to ensure that the systems that users interact with are meeting the business’s needs.
Optimizing pipelines and ETL systems to ensure that the systems that move data from place to place are as efficient as possible.

The author of the blog post is encouraging the reader to continue learning and to review the material before the next assessment. They also remind the reader that they will have the chance to put everything they have been learning into practice by developing BI tools and processes themselves.

As BI professional, your job doesn’t end
once you’ve built the database systems and pipeline tools for your organization. It’s also important that you ensure
they continue to work as intended and handle potential errors
before they become problems in order to address those ongoing needs. You’ve been learning a lot. First, you explored the importance
of quality testing in an ETL system. This involved checking incoming data for
completeness, consistency, conformity, accuracy, redundancy,
integrity, and timeliness. You also investigated
schema governance and how schema validation can prevent incoming
data from causing errors in the system by making sure it conforms to the scheme of
properties of the destination database. After that, you discovered why verifying
business rules is an important step in optimization because it ensures that
the data coming in meets the business needs of the organization using it. Maintaining the storage systems that
users interact with is an important part of ensuring that your system is
meeting the business’s needs. This is why database optimization is so
important, but it’s just as important to ensure that
the systems that move data from place to place are as efficient as possible. And that’s where optimizing pipelines and
ETL systems comes in. Coming up, you have another assessment. I know you can do this and
just as a reminder, you can review any of the material as you
get ready as well as the latest glossary. So feel free to revisit any videos or readings to get a refresher
before the assessment. After that, you’ll have the chance to
put everything you’ve been learning into practice by developing BI tools and
processes yourself. You’re making excellent
progress toward a career in BI.

Reading: Glossary terms from module 3

Reading

Accuracy: An element of quality testing used to confirm that data conforms to the actual entity being measured or described

Business rule: A statement that creates a restriction on specific parts of a database

Completeness: An element of quality testing used to confirm that data contains all desired components or measures

Conformity: An element of quality testing used to confirm that data fits the required destination format

Consistency: An element of quality testing used to confirm that data is compatible and in agreement across all systems

Data dictionary: A collection of information that describes the content, format, and structure of data objects within a database, as well as their relationships

Data lineage: The process of identifying the origin of data, where it has moved throughout the system, and how it has transformed over time

Data mapping: The process of matching fields from one data source to another

Integrity: An element of quality testing used to confirm that data is accurate, complete, consistent, and trustworthy throughout its life cycle

Quality testing: The process of checking data for defects in order to prevent system failures; it involves the seven validation elements of completeness, consistency, conformity, accuracy, redundancy, integrity, and timeliness

Redundancy: An element of quality testing used to confirm that no more data than necessary is moved, transformed, or stored

Schema validation: A process to ensure that the source system data schema matches the target database data schema

Timeliness: An element of quality testing used to confirm that data is current

Quiz: Module 3 challenge

Fill in the blank: Quality testing is the process of checking data for _ in order to prevent system failures.

defects

AI assistant

The answer is defects.

Quality testing is the process of checking data for defects in order to prevent system failures. Defects in data can include errors in accuracy, consistency, completeness, and timeliness. If these defects are not identified and corrected, they can lead to system failures, which can disrupt business operations and cause financial losses.

Scalability is the ability of a system to handle increasing amounts of data and users. Links are the connections between different pieces of data. Granularity is the level of detail in data.

None of these other options are directly related to the purpose of quality testing, which is to prevent system failures.

A business intelligence professional is confirming that their data is compatible and in agreement across all systems. Which quality testing validation element does this involve?

Consistency

AI assistant

The correct answer is consistency.

Consistency is the quality testing validation element that ensures that data is consistent across all systems. This means that the same data should have the same value and format in all systems. For example, a customer’s name should be spelled the same way in the CRM system, the ERP system, and the marketing automation system.

The other answer choices are not as accurate:

Integrity: Integrity ensures that data is accurate and complete.
Accuracy: Accuracy ensures that data is correct and reflects the real world.
Completeness: Completeness ensures that all of the necessary data is present.

Consistency is important for business intelligence because it allows businesses to make accurate and reliable decisions. When data is consistent, businesses can be confident that they are looking at the same picture across all of their systems. This can help businesses to identify trends, patterns, and opportunities that they might otherwise miss.

Here are some examples of consistency checks that a business intelligence professional might perform:

Checking that the same customer ID is used in all systems.
Checking that the same product code is used in all systems.
Checking that the same date format is used in all systems.
Checking that the same currency code is used in all systems.
Checking that the same measurement units are used in all systems.

By performing these consistency checks, business intelligence professionals can help to ensure that their data is reliable and accurate.

A data warehouse is supposed to contain weekly data, but it does not update properly. As a result, the pipeline fails to ingest the latest information. What aspect of the data is being affected in this situation?

Timeliness

AI assistant

The answer is timeliness.

Timeliness refers to the availability of data within a specified time frame. In this case, the data warehouse is supposed to contain weekly data, but it is not updating properly, which means that the data is not timely.

The other answer choices are not as accurate:

Conformity: Conformity refers to the consistency of data with a predetermined schema.
Maturity: Maturity refers to the level of development of a data warehouse.
Redundancy: Redundancy refers to the duplication of data within a data warehouse.

Timeliness is important for data warehouses because it allows businesses to make decisions based on the most up-to-date information. When data is not timely, businesses may not be able to identify trends and patterns accurately. This can lead to poor decision-making and missed opportunities.

In this specific situation, the pipeline is failing to ingest the latest information because the data warehouse is not being updated properly. This means that the data in the pipeline is not timely and cannot be used to make accurate decisions.

To resolve this issue, the business intelligence team would need to investigate why the data warehouse is not updating properly and fix the problem. Once the data warehouse is updating properly, the pipeline will be able to ingest the latest information and businesses will be able to make decisions based on the most up-to-date data.

Business intelligence professionals use schema validation, data dictionaries, and data lineages while establishing consistent data governance. Which aspect of data validation does this involve?

Conformity

AI assistant

The aspect of data validation that schema validation, data dictionaries, and data lineages involve is conformity.

Schema validation is the process of ensuring that data conforms to a predefined schema. A schema is a set of rules that define the structure of data. Schema validation can be used to ensure that data is complete, accurate, and consistent.

Data dictionaries are documents that describe the data in a database or data warehouse. Data dictionaries can be used to identify the meaning of data elements, the data types of data elements, and the relationships between data elements.

Data lineages are maps that show the flow of data through a system. Data lineages can be used to track the origin of data, the transformations that data undergoes, and the destination of data.

Conformity refers to the consistency of data with a predetermined schema. In the context of data governance, conformity ensures that data is consistent across all systems and that it meets the business needs of the organization.

The other answer choices are not as accurate:

Quality refers to the overall quality of data, including its accuracy, completeness, consistency, and timeliness.
Security refers to the protection of data from unauthorized access, use, disclosure, disruption, modification, or destruction.
Context refers to the meaning of data within a specific context.

Schema validation, data dictionaries, and data lineages are all tools that can be used to establish consistent data governance and ensure that data is in conformity with a predefined schema.

Here are some specific examples of how schema validation, data dictionaries, and data lineages can be used to ensure data conformity:

Schema validation can be used to ensure that all data in a database conforms to the same set of rules. This helps to prevent data inconsistencies.
Data dictionaries can be used to document the meaning of data elements and to ensure that data is used consistently across all systems.
Data lineages can be used to track the flow of data through a system and to identify any potential data inconsistencies.

By using these tools, business intelligence professionals can help to ensure that data is consistent and that it meets the business needs of the organization.

What process involves confirming the validity of database keys, preserving table relationships, and ensuring that conventions are consistent?

Schema validation

AI assistant

The process that involves confirming the validity of database keys, preserving table relationships, and ensuring that conventions are consistent is called schema validation.

Schema validation is the process of ensuring that data conforms to a predefined schema. A schema is a set of rules that define the structure of data, including the names of data elements, the data types of data elements, and the relationships between data elements.

Schema validation is important because it helps to ensure the quality and integrity of data. When data conforms to a schema, it is more likely to be accurate, complete, and consistent. This makes it easier to use data for analysis and decision-making.

Here are some specific examples of schema validation checks:

Checking that all foreign keys in a table reference valid primary keys in other tables.
Checking that all data elements in a table have the correct data type.
Checking that all data elements in a table follow the same naming conventions.
Checking that all relationships between tables are consistent.

Schema validation can be performed manually or using automated tools. Automated schema validation tools are often used in enterprise organizations to ensure that data quality and integrity are maintained across large and complex databases.

The other answer choices are not as accurate:

Sorting is the process of arranging data in a specific order.
Data composition is the process of combining multiple data sources into a single data set.
Typecasting is the process of converting data from one data type to another.

While these processes can be used to improve the quality of data, they do not specifically address the need to confirm the validity of database keys, preserve table relationships, and ensure that conventions are consistent.

Example

Here is an example of how schema validation can be used to ensure the integrity of a database:

A database has two tables: customers and orders. The customers table has a primary key column called customer_id. The orders table has a foreign key column called customer_id that references the customer_id column in the customers table.

Schema validation can be used to check that all of the foreign key values in the orders table reference valid primary key values in the customers table. This helps to ensure that the data in the database is accurate and consistent.

If a schema validation check reveals that there are any invalid foreign key values in the orders table, the database administrator can take corrective action to fix the problem. This could involve updating the foreign key value in the orders table or deleting the order record altogether.

By performing schema validation regularly, database administrators can help to ensure the quality and integrity of their data.

Fill in the blank: A data _ describes the process of identifying the origin of data, where it has moved throughout the system, and how it has transformed over time.

lineage

AI assistant

The answer is lineage.

A data lineage describes the process of identifying the origin of data, where it has moved throughout the system, and how it has transformed over time. Data lineage is important for ensuring data quality and compliance, as it allows organizations to track the flow of data through their systems and to identify any potential errors or inconsistencies.

A data lineage map is a visual representation of the flow of data through a system. It shows the different stages of the data pipeline, from the source systems where the data is generated to the destination systems where the data is used for analysis or reporting.

Data lineage models are mathematical representations of the flow of data through a system. They can be used to simulate the flow of data and to identify potential bottlenecks or areas where data quality may be at risk.

Data dictionaries are documents that describe the data in a database or data warehouse. They include information such as the names of data elements, the data types of data elements, and the relationships between data elements.

Example

Here is an example of a simple data lineage:

Source system: CRM system
Data elements: Customer name, customer email address, customer phone number
Transformation: Data is cleansed and normalized
Destination system: Data warehouse

In this example, the data lineage shows that the customer data is extracted from the CRM system, cleansed and normalized, and then loaded into the data warehouse. The data lineage can also be used to track any changes that are made to the data along the way. For example, if the customer’s email address is changed in the CRM system, the data lineage can be used to track that change and to update the customer’s email address in the data warehouse.

Data lineage is an important tool for ensuring data quality and compliance. By tracking the flow of data through their systems, organizations can identify and address any potential problems before they cause any damage.

A business intelligence professional establishes what data will be collected, stored, and provided in a database. They also confirm how relationships are defined and the security of the data. What process does this scenario describe?

Creating business rules

AI assistant

The process that the business intelligence professional in the scenario is describing is database modeling.

Database modeling is the process of creating a logical representation of a database. It involves identifying the different types of data that will be stored in the database, defining the relationships between the different entities, and specifying the rules for accessing and modifying the data.

Database modeling is important because it helps to ensure that the database is well-designed and that it meets the needs of the organization. A well-designed database is more likely to be efficient, accurate, and secure.

The business intelligence professional in the scenario is responsible for establishing the following aspects of the database model:

What data will be collected, stored, and provided in the database: This involves identifying the different types of data that the organization needs to collect and store in order to meet its business needs.
How relationships are defined: This involves defining the relationships between the different entities in the database. For example, a customer entity might have a relationship with an order entity.
The security of the data: This involves specifying who has access to the data and what they can do with it.

The business intelligence professional will also work with other stakeholders, such as developers and database administrators, to implement the database model and to ensure that it meets the needs of the organization.

The other answer choices are not as accurate:

Creating business rules: Business rules are statements that define how the data in a database should be used. While business rules are important, they are not part of the database modeling process.
Optimization: Optimization is the process of improving the performance of a database. While optimization is important, it is not part of the database modeling process.
Iteration: Iteration is the process of developing a database in a series of steps. While iteration is a common approach to database development, it is not a specific process name.

Database modeling is a critical step in the development of any database. By carefully modeling the database, business intelligence professionals can help to ensure that the database is well-designed, meets the needs of the organization, and is secure.

Home » Google Career Certificates » Google Business Intelligence Professional Certificate » Course 2: The Path to Insights: Data Models and Pipelines » Week 3: Optimize ETL processes

Week 3: Optimize ETL processes

Optimizing pipelines and ETL processes

Video: Welcome to module 3

Video: The importance of quality testing

Reading: Seven elements of quality testing

Common issues

Key takeaways

Upgraded Plugin: Validate: Data quality and integrity

Reading: Monitor data quality with SQL

The scenario

Using SQL to find problems

Key takeaways

Video: Mana: Quality data is useful data

Practice Quiz: Test your knowledge: Optimize pipelines and ETL processes

Data schema validation

Video: Conformity from source to destination

Reading: Sample data dictionary and data lineage

Data dictionaries

Data lineages

Key takeaways

Video: Check your schema

Reading: Schema-validation checklist

Common issues for schema validation

Using data dictionaries and lineages

Key takeaways

Practice Quiz: Activity: Evaluate a schema using a validation checklist

Reading: Activity Exemplar: Evaluate a schema using a validation checklist

Practice Quiz: Test your knowledge: Data schema validation

Business rules and performance testing

Video: Verify business rules

Reading: Business rules

Imposing business rules

Verifying business rules

Key takeaways

Reading: Database performance testing in an ETL context

How database performance affects your pipeline

Key factors in performance testing

Key takeaways

Upgraded Plugin: Evaluate: Performance test your data pipeline

Reading: Defend against known issues

Scenario

Pipeline layers

Error handling reporting and analysis

Key takeaways

Video: Burak: Evolving technology

Reading: Case study: FeatureBase, Part 2: Alternative solutions to pipeline systems

Practice Quiz: Test your knowledge: Business rules and performance testing

Review: Optimize ETL processes

Video: Wrap-up

Reading: Glossary terms from module 3

Quiz: Module 3 challenge

Example

Example

Share this:

Like this: