Skip to content
Home » Google Career Certificates » Google Advanced Data Analytics Professional Certificate » Get Started with Python » Week 4: Data structures in Python

Week 4: Data structures in Python

Now, you’ll explore fundamental data structures such as lists, tuples, dictionaries, sets, and arrays. Lastly, you’ll learn about two of the most widely used and important Python tools for advanced data analysis: NumPy and pandas. 

Learning Objectives

  • Explain how to manipulate dataframes using techniques such as selecting and indexing, boolean masking, grouping and aggregating, and merging and joining
  • Describe the main features and methods of core pandas data structures such as dataframes
  • Describe the main features and methods of core NumPy data structures such as arrays and series
  • Define Python tools such as libraries, packages, modules, and global variables
  • Describe the main features and methods of built-in Python data structures such as lists, tuples, dictionaries, and sets

List and tuples


Welcome to module 4

The text highlights the learner’s progress in their Python learning journey. It emphasizes the learner’s ability to use variables, data types, functions, operators, conditional statements, and loops. The text introduces the concept of data structures and their importance in data analysis. It also discusses two crucial libraries for data professionals: NumPy and pandas. The text concludes by inviting the learner to continue their learning journey in the next video.

Hello again! You’ve come so far on
your learning journey! Just think of all the new Python skills you’ve developed along the way. You’ve learned how to use variables to store and label your data, and how to work with different data types, such as integers, floats, and strings. You can call functions to
perform actions on your data, and use operators to compare values. You also know how to write clean code that can be easily understood and reused by other data professionals. You can write conditional statements to tell the computer how to make decisions based on your instructions. And recently, you learned how to use loops to automate repetitive tasks. Coming up, we’ll explore data structures, which are collections of data values or objects that contain
different data types. Data professionals use data structures to store, access, organize, and categorize their data with speed and efficiency. Knowing which data structures fit your specific task is a
key part of data work, and will help streamline your analysis. We’ll focus on data structures that are among the most useful for data professionals: lists, tuples, dictionaries,
sets, and arrays. Part of what makes Python such a powerful and versatile programming language are the libraries and packages that are available to it. After we review fundamental
data structures, we’ll discuss two of the
most important libraries and packages for data professionals. The first is Numerical Python, or NumPy, which is known for its
high-performance computational power. Data professionals use NumPy to rapidly process large
quantities of data. I often use NumPy in my job because it’s so useful for analyzing large and complex datasets. The second is Python
Data Analysis Library, or pandas, which is a key tool for advanced data analytics. Pandas makes analyzing data in the form of a table with rows and columns easier and more efficient, because it has tools specifically
designed for the job. When you’re ready, I’ll
meet you in the next video!

Lab: Annotated follow-along guide: Data structures in Python

Video: Introduction to lists

In this video, the differences between data types and data structures are discussed, with a focus on the list as a specific kind of data structure in Python. Data types are attributes describing data based on values, programming language, or operations it can perform. Data structures are collections of data values, and lists, in particular, help store and manipulate ordered collections of items.

Lists, like strings, are sequences that allow duplicate elements, indexing, and slicing. Lists, however, are mutable, meaning their elements can be modified, added, or removed, whereas strings are immutable. The tutorial explains how to access elements in a list using indices and perform slicing to create subsets of the list.

The concept of mutability and immutability is highlighted, emphasizing that lists are mutable, allowing changes to their internal state. Practical examples are provided, such as checking if a list contains a specific element using the “in” keyword to generate a Boolean statement.

The tutorial underscores the usefulness of lists for organizing and categorizing related data, performing operations on multiple values simultaneously, and simplifying code. The audience is encouraged to stay tuned for more insights into working with lists.

Introduction to Lists in Python

Lists are a fundamental data structure in Python, used to store collections of items. They are versatile and can hold various data types, including integers, floats, strings, and even other lists. Lists are mutable, meaning their contents can be modified after creation.

Creating Lists

Creating a list in Python is straightforward. Use square brackets ([]) and enclose the items separated by commas. For example:

Python

my_list = [1, 2, 3, 4, 5]
print(my_list)

This code creates a list named my_list containing the numbers 1 to 5. The print statement displays the list’s contents.

Accessing List Elements

Elements in a list are accessed using their index, which starts from 0. Positive indexes refer to elements from the beginning of the list, while negative indexes start from the end. For example:

Python

print(my_list[0])  # Output: 1
print(my_list[-1])  # Output: 5

Modifying Lists

Lists are mutable, allowing you to change their contents after creation. You can add, remove, or modify elements using various methods. For example:

Python

my_list.append(6)  # Adds 6 to the end of the list
my_list.remove(2)  # Removes the element with value 2
my_list[1] = 7  # Replaces the second element with value 7

List Operations

Python provides various operations for manipulating lists, such as:

  • len(my_list): Returns the length of the list
  • my_list.sort(): Sorts the list in ascending order
  • my_list.reverse(): Reverses the order of elements in the list
  • my_list + another_list: Concatenates two lists

List Slicing

List slicing allows you to extract a sublist from a list. Slicing uses the colon (:), where the start index (optional), end index (optional), and step (optional) are specified. For example:

Python

sublist = my_list[1:4]  # Extracts elements from index 1 (inclusive) to 4 (exclusive)
print(sublist)  # Output: [2, 3, 4]

List Comprehensions

List comprehensions provide a concise way to create lists based on an expression. They use the for loop and conditional statements within square brackets. For example:

Python

squares = [x * x for x in range(1, 6)]  # Creates a list of squares from 1 to 5
print(squares)  # Output: [1, 4, 9, 16, 25]

Conclusion

Lists are a powerful and versatile data structure in Python, essential for organizing and managing collections of data. Their flexibility and ease of use make them a valuable tool for various programming tasks.

Great to be with you again. In this video, we’ll
discuss the differences between data types and data structures. Then, we’ll explore lists, which are a specific
kind of data structure. As you’ve learned, a
data type is an attribute that describes a piece of
data based on its values, its programming language, or
the operations it can perform. In the context of Python, this includes the classes, integers, strings, floats, and Boolean
expressions, among others. A data structure is a
collection of data values, or objects that can contain
different data types. So, data structures can
contain data type elements, such as a float or a string. Data structures also enable
more efficient storage, access, and modifications. They allow you to organize
and categorize your data, relate collections of data to each other, and perform operations accordingly. One data structure in Python is a list. A list is a data structure
that helps store, and if necessary, manipulate
an ordered collection of items, such as a list of email addresses associated with a user account. A list is a lot like a string, and you can do many of the
same things with lists. For example, both strings and lists allow duplicate elements, as well as, indexing and slicing. Additionally, both are sequences. A sequence is a positionally
ordered collection of items. However, strings are
sequences of characters, while lists store sequences
of elements of any data type. There is some other key differences between lists and strings. First, note that different data structures are either mutable or immutable. Mutability refers to the ability to change the internal state of a data structure. Immutability is the reverse, where a data structure,
or element’s values can never be altered or updated. Lists and their contents are mutable, so their elements can be
modified, added, or removed. But strings are immutable. Think of a list like a long box, with a space inside, divided
into different slots. Each slot contains a value, and each value can store data. This could be another data structure, such as another list, or an integer, string, float, or output from another function. When working with lists, you use an index to access each of the elements. Recall that an index provides
the numbered position of each element in an ordered sequence. In this case, our sequence is a list. Let’s go through an
indexing example together. First, assign the following list of words to a variable X: “Now,” “we,” “are,” “cooking,” “with,” “seven,” “ingredients.” In Python, we use square
brackets to indicate where the list starts and ends. And commas to separate each
element contained in it. To print an element of a
list, use it’s index number. So, to print the word “cooking,” print the third element
of the list variable X. This is just like focusing
on a specific character or substring in a string. The first element in a list,
as with strings, is zero. So, if we print the element
with slot number three, we get the item, or word “cooking,” from our list of seven words. Remember that indexing
always starts at zero. So, if we had typed seven, to try to access the
last word of our list, we’d get an IndexError. We can also use indices to
create a slice of the list. For this, use ranges of two
numbers, separated by a colon… to get the second and third
words of our list: “we” and “are.” You can also use a colon, too, to get all words until the index slot two… We get “now” and “we”, which have index slots zero and one. And to leave one of the range
indexes empty, use two colon. This will give us the
other part of the list. So, just as with string indexing, the first value for the
first element in the list defaults to zero. And the second value, if left empty, defaults to the length of the list. To check if a list of words
contains a certain element, like the word “this”,
use the key word “in” to generate a Boolean statement. This verifies whether the word exists. The result of this check is a Boolean, which we can then use as a condition for branching or looping
in the rest of the code. Lists are very useful
when you’re working with many related values. They enable you to keep
the right data together, simplify your code, and
perform the same operations on multiple values at once. Coming up, we have even more on lists.

Fill in the blank: Mutability refers to the ability to _____ the internal state of a data structure.

change

Mutability refers to the ability to change the internal state of a data structure. Immutability is the reverse, where a data structure or element’s values can never be altered or updated.

Video: Modify the contents of a list

This video continues the discussion of lists in Python, focusing on modifying their contents. The append() method adds an element to the end of a list, while the insert() method inserts an element at a specific index. The remove() method removes an element from the list, and the pop() method removes and returns an element from a specific index.

The text also discusses the difference between mutable and immutable data types. Strings are immutable, meaning that their contents cannot be changed after creation. Lists, on the other hand, are mutable, meaning that their contents can be changed after creation.

The video concludes with a snack break and a promise to continue the discussion of lists in the next video.

Essential Operations for Modifying Lists in Python

Manipulating lists is a fundamental aspect of programming in Python. Lists are mutable data structures, meaning their contents can be altered after creation. This flexibility makes them invaluable for storing and managing collections of data.

1. Adding Elements to Lists: The append() Method

The append() method is the most straightforward way to add elements to a list. It inserts the specified element to the end of the list. The syntax is simple:

Python

list_name.append(element_to_add)

For instance, consider a list of fruits:

Python

fruits = ["apple", "banana", "orange"]

To add “mango” to the list, use the append() method:

Python

fruits.append("mango")

The updated list will be:

Python

fruits = ["apple", "banana", "orange", "mango"]

2. Inserting Elements into Lists: The insert() Method

The insert() method provides more precise control over element placement. It inserts the specified element at a particular index within the list. The syntax is:

Python

list_name.insert(index, element_to_insert)

Here’s an example of inserting “kiwi” at index 1:

Python

fruits.insert(1, "kiwi")

The updated list will be:

Python

fruits = ["apple", "kiwi", "banana", "orange", "mango"]

3. Removing Elements from Lists: The remove() and pop() Methods

The remove() method eliminates the first occurrence of a specified element from the list. Its syntax is:

Python

list_name.remove(element_to_remove)

For example, to remove “banana” from the list:

Python

fruits.remove("banana")

The updated list will be:

Python

fruits = ["apple", "kiwi", "orange", "mango"]

The pop() method, on the other hand, removes and returns the element at a specific index. Its syntax is:

Python

removed_element = list_name.pop(index)

For instance, to remove and store the element at index 2:

Python

removed_fruit = fruits.pop(2)

This will remove “orange” from the list and store it in the variable removed_fruit. The updated list will be:

Python

fruits = ["apple", "kiwi", "mango"]

4. Replacing Elements in Lists:

To replace an element at a specific index, use assignment:

Python

list_name[index] = new_element

For example, to replace “kiwi” with “pineapple”:

Python

fruits[1] = "pineapple"

The updated list will be:

Python

fruits = ["apple", "pineapple", "mango"]

5. Clearing Lists: The clear() Method

The clear() method removes all elements from a list, effectively emptying it. Its syntax is:

Python

list_name.clear()

Applying this to the fruits list will result in:

Python

fruits = []

Conclusion

Modifying lists in Python is essential for data manipulation and organization. The append(), insert(), remove(), pop(), and clear() methods provide powerful tools for adding, removing, and replacing elements, making lists versatile data structures.

In this video, we’ll
continue with lists. You’ll learn how to modify
the contents of a list. This will give you greater
control over your lists because you can add, remove, and change the items that they contain. Previously, we thought about
a list as a box divided into different slots. Modifying it means we keep
the box, but we add, remove, or change what’s inside. When thinking about modifying lists, there are a few methods that can be used. We’ll begin with the append method. The append method adds an
element to the end of a list. This requires one argument because this function
adds the incoming element to the end of the list
as a single new entry. You can even start with an empty list and all new elements
will be added at the end. Let’s explore an example. We’ll begin by typing a list of fruits: Upon further inspection,
it seems we forgot to add kiwi to the list. So, we can use the
append method to add it. This uses one parameter; in
this case, the string kiwi. Another common method for
modifying lists is insert, which requires two arguments: the index number of the
element to be modified, and the contents being put in that slot, such as a string or integer. Let’s investigate how insert works. Insert is a function that takes an index as the first parameter and an element as the second parameter,
then inserts the element into a list at the given index. Returning to our list of fruits… Now, orange is inserted at
the second spot at index one in our fruit list. Let’s add an element at
the beginning of the list. In the first parameter, we’ll put zero. Then, type mango as the second element. To remove an element from a list, let’s consider the remove method. Remove is a method that
removes an element from a list. Similar to the append method, remove only requires one parameter. Now our fruit list no longer has a banana. If we try to remove an element
that is not in the list, like strawberries, for
example, we get a ValueError. Another common way to remove elements is with the pop method, which uses an index. Pop extracts an element
from a list by removing it at a given index. So, to remove orange,
pop the third element in the list with index number two. Now, suppose that, after removing orange, you decide to also remove pineapple and replace it with mango. Simply reassign its value. Reference the pineapple
item’s index number, one, and replace it with mango. This renders the list without orange, because we already removed it, as well as mango, which
replaced pineapple. Our fruits have changed
a lot since we started. But it’s always the
same list, the same box. We’ve just modified what’s inside. At this point, I want to address something that new learners of
Python often wonder about. You’ll recall that strings are immutable and lists are mutable. What this means exactly
might not be clear at first. After all, didn’t we have
multiple videos about how to manipulate strings? A new example will help
make this more clear. Whenever we modify a
string, we always have to reassign the change
back to the variable that contained the string. This is overwriting the existing variable with a brand new one. Notice that we can’t, say, overwrite the character at index 0. We get an error. However, we can do this with a list. That’s why lists are considered mutable. Great work modifying lists. Now, after all that
talk about tasty fruit, I think we deserve a snack break! I’ll catch up with you again very soon.

Reading: Reference guide: Lists

Video: Introduction to tuples

Tuples are immutable sequences of data that are useful for storing information that needs to be processed together in the same structure. They are more secure than lists because they cannot be changed as easily. Tuples can be instantiated using parentheses or the tuple() function. They can also be used to return values from functions. Tuples are iterable, so we can extract information from them using loops. Using tuples in data professional work can help make your processes more efficient and save your team time and effort.

What are Tuples?

Tuples are a fundamental data structure in Python. They are immutable sequences of objects, meaning that their contents cannot be changed after creation. This makes them a secure way to store data that needs to remain constant. Tuples are often used to store collections of related data, such as the name, age, and city of a person.

Creating Tuples

Tuples are created using parentheses. For example, the following code creates a tuple containing the names of three fruits:

fruits = ("apple", "banana", "orange")

You can also use the tuple() function to create a tuple from a list or other iterable:

numbers = [1, 2, 3, 4, 5]
numbers_tuple = tuple(numbers)

Accessing Elements of Tuples

Elements of a tuple are accessed using their index, which is a non-negative integer. The first element has index 0, the second element has index 1, and so on. For example, the following code accesses the second element of the fruits tuple:

second_fruit = fruits[1]
print(second_fruit)  # Output: banana

Immutability of Tuples

One of the key features of tuples is that they are immutable. This means that once a tuple is created, its contents cannot be changed. For example, the following code will raise an error:

fruits[0] = "grape"  # This will raise an error

Operations on Tuples

There are a few basic operations that can be performed on tuples:

  • Indexing: Accessing elements of a tuple using their index.
  • Slicing: Extracting a sub-sequence from a tuple.
  • Concatenation: Combining two tuples into a single tuple.
  • Membership: Checking whether an element is present in a tuple.
  • Length: Determining the number of elements in a tuple.

Comparison of Tuples

Tuples can be compared using the usual comparison operators (==, !=, <, >, <=, >=). Two tuples are considered equal if they have the same length and the corresponding elements are equal.

Applications of Tuples

Tuples are often used in the following situations:

  • Storing constant data: When you need to store data that should not be changed, such as the names of months or the days of the week.
  • Returning multiple values from functions: Tuples are a convenient way to return multiple values from functions.
  • Storing data with positional significance: When the order of the data is important, such as the coordinates of a point in space.

Conclusion

Tuples are a versatile and powerful data structure in Python. Their immutability makes them a secure way to store data, and their ability to be used in various operations makes them a valuable tool for data manipulation. By understanding the basics of tuples, you can enhance your Python programming skills and tackle a wide range of programming tasks.

As a data professional, sometimes it will be
more important to access and reference data than to
change and manipulate it. When you simply need to
find some information, but keep the data intact, you can use a data
structure called a tuple. A tuple is an immutable sequence that can contain elements
of any data type. Tuples are kind of like lists, but they’re more secure because they cannot be changed as easily. They’re helpful because they keep data that needs to be processed
together in the same structure. Tuples are instantiated, or
expressed, with parenthesis or the tuple() function. Here we have a tuple that
represents someone’s full name. Notice that it was
instantiated using parentheses. The first element of the
tuple is their first name, the second is the first
letter of their middle name, and the third is their last name. The position of the
element is fixed in tuples, so you can’t add new
elements in the middle, and you can’t change any of the elements. If we try to change the last name, which lives in index number
2, from Hopper to Copper, the code will throw an error. You can add a value to the end, but only if you reassign the tuple. Another way we can create tuples is by using the tuple function to transform input into tuples. In this case, our name is
represented as a list. We can convert the list to a tuple by using the tuple function. Notice that the name is no longer a list, so it doesn’t have the brackets anymore. Tuples are also used to
return values and functions. In fact, when a function
returns more than one value, it’s actually returning a tuple. For example, here’s a function
that takes as an argument a float value that represents a price. The function returns the
number of dollars and cents. Notice that when we use
the function to convert $6 and 55 cents to dollars and cents, the return value is represented as a tuple that contains two numbers. Interestingly, even though
tuple are immutable, they can be split into separate variables. When we run the “to dollar cents” function, we can directly assign the
output into distinct variables. The information stored as a
tuple in the result variable has now been reassigned
to two separate variables that we can manipulate as we please. This process is known
as unpacking a tuple. Notice that the unpacked
variables themselves are no longer tuples. In this case, they’re integers. A big advantage of working with tuples is that they let you store
data of different types inside other data structures. Here’s an example of how
this might be useful. This is a list of the
starting five players on a university women’s basketball team. Each player is represented by a tuple that contains their
name, age, and position. This is a useful way of working with this type of information. The order of the players
doesn’t matter that much, and we might want to add
to or rearrange them. So we use a list, which is mutable. However, the players themselves
are individual records that are represented by tuples. They are a bit more secure
because tuples are immutable and more difficult to accidentally change. Because lists and tuples are iterable, we can extract information
from them using loops. For example, we can write a “for” loop that unpacks each tuple into
three separate variables, and then print one of the
variables for each iteration. This is equivalent to looping
over each player record and printing the record at index zero. Using tuples in data professional work helps make your processes more efficient. It saves memory and can really
optimize your programs too. Plus, when others collaborate
with you on your code, your use of tuples will
make it clear to them that your sequences of values are not intended to be modified. This is yet another great
way to save your team time and effort.

A tuple is an immutable sequence that can contain elements of any data type.

True

A tuple is an immutable sequence that can contain elements of any data type.

Reading: Compare lists, strings, and tuples

Reading

Video: More with loops, lists, and tuples

This video discusses more complex examples of how to work with loops, lists, and tuples in Python. It also introduces a few new tools that are useful for data professionals.

One example is a function that extracts the name and position of each player in a list of tuples. The function uses a for loop to unpack the tuples and then formats the data into a string.

Another example is a nested loop that creates all the different domino tiles in a set of dominoes. The inner loop generates the pips on the right side of the domino, and the outer loop generates the pips on the left side of the domino.

Finally, the video discusses list comprehensions, which are a concise way to create new lists from existing lists. A list comprehension is basically a for loop written in reverse, with the computation at the beginning and the “for” statement at the end.

The video encourages viewers to explore the notebook on their own and play with the code to learn more.

Hello again! In this video, we’ll consider more complex examples of different ways to work with loops, lists, and tuples. I’ll also introduce a few new tools that are useful for data professionals. Let’s return to the
women’s basketball team list of players from the previous video. In this example, we’ll integrate string formatting, loops, tuples, and lists. Remember, we have a list of tuples, where each tuple contains the name, age, and position of a player. Let’s define a function that extracts the name and position of each player into a list that we’ll use to print the information. We’ll call the function “player position,” and its argument will be a list of tuples that contain player information. Next, we’ll instantiate an empty list that will hold our result, which we’ll build as we
loop through the data. Now we’ll use a for loop to unpack the tuples in our list of players. The variables we assign in the for loop must align with the format of the tuples. There are three components
of each tuple in our list: name, age, and position, so we need three
variables in our for loop. If we tried to unpack the tuples using only two variables, like, “for name, age in players,” the computer would throw an error. It wouldn’t know what to do, because there are three
elements of the tuple but we’re only giving the computer two containers to put them in. So we’ll begin our for loop “for name, age, position in players.” Then we’ll use string formatting to append each name and position to the result list. Each string will include some positional formatting
and a new line too. Finally, we’ll call this
function in a for loop. This for loop will iterate over the results list that is output by the function and print each one. Now we have a nicely formatted, easily readable table of
players and positions. Here’s another application
of loops and lists. This is an example of nested loops. A nested loop is when you have one loop inside of another loop. These loops create all
the different domino tiles in a set of dominoes, which is a tile-based game played with numbered gaming pieces. Feel free to pause the video and try to figure out what’s happening… We start with the numbers on the left side of the dominoes. These numbers represent the dots, or pips, on the domino. They range from zero to six. For each number in this range, we’ll run another loop to generate the pips on the right side of the domino. Then we insert the left number and right number into a
formatted print statement. And here are the dominoes! Notice that in the first print statement we included a parameter called “end” whose value was a whitespace. When a print statement executes, by default it will end with a new line. So without this parameter, all the dominoes would have been printed in a vertical line, each one beneath the next. But when we set the end
character to a whitespace, it prints a space between
each domino instead. Here’s the same code, but instead of printing
the dominoes as strings, it stores each one as a tuple of integers in a list called “dominoes.” Now suppose we want to check the second number of
the tuple at index four. We can do that by using indexing. Start with the list we
want to access, “dominoes,” and put the index of the tuple we want to access in brackets. Then add another pair
of brackets containing the index of the value within that tuple. What if we want to calculate the total number of pips on each domino? We can do that with a for loop that iterates over each tuple, sums the value at index zero and the value at index one, and appends the sum to a list. But there’s a much
easier way of doing this. It’s called a list comprehension. A list comprehension formulaically creates a new list based on the values in an existing list. Here’s how it works. Begin by assigning a
variable for our new list. We’ll call this one “pips from list comp.” For its value, create an empty list. Now we basically write a for loop, only in reverse. We begin with the calculation that creates each element of the list. In this case, we want each element to be
the total number of pips on the domino, which is the domino at index zero plus the
domino at index one. Then we add a “for” statement. We can check to make sure it gave the same result as our for loop. They’re the same! Note what happened. This is why I said a list comprehension is like a for loop written in reverse. The “for” part of it is at
the end of the statement and the computation is at the beginning. Both the for loop and
the list comprehension do the same thing, but
the list comprehension is much more elegant, and usually faster to execute too. Hopefully by now you can appreciate how powerful the building
blocks of Python can be. I encourage you to explore this notebook on your own and play with the code to discover what happens when you add something here or change something there. Playing with code is one
of the best ways to learn! I’ll see you soon.

Reading: zip(), enumerate(), and list comprehension

Reading

Lab: Exemplar: Lists & tuples

Practice Quiz: Test your knowledge: Lists and tuples

Lists and their contents are immutable, so their elements cannot be modified, added, or removed.

What Python method adds an element to the end of a list?

A data professional wants to instantiate a tuple. What Python elements can they use to do so? Select all that apply.

What Python technique formulaically creates a new list based on the values in an existing list?

Dictionaries and sets


Video: Introduction to dictionaries

Dictionaries are a fundamental data structure in Python that store data in key-value pairs. They are versatile and widely used by data professionals to analyze large datasets and store user information. Dictionaries can be created using braces or the dict() function. Keys must be immutable, such as integers, floats, tuples, or strings. Dictionaries are unordered, meaning you cannot access them by index. Use the “IN” keyword to check if a key exists. Dictionaries are powerful and versatile, and we will explore more examples and tools in the next lesson.

Introduction to Dictionaries in Python

Dictionaries, also known as associative arrays, are a versatile data structure in Python that stores data in key-value pairs. Each key is unique and serves as an identifier for its corresponding value. Dictionaries are mutable, meaning their contents can be changed after they are created.

Creating a Dictionary

Dictionaries can be created using curly braces ({}) or the dict() function.

Using Curly Braces

Python

# Creating a dictionary using curly braces
animal_sounds = {'dog': 'bark', 'cat': 'meow', 'bird': 'chirp'}
print(animal_sounds)

Output:

{'dog': 'bark', 'cat': 'meow', 'bird': 'chirp'}

Using the dict() Function

Python

# Creating a dictionary using the dict() function
employee_data = dict()
employee_data['name'] = 'Alice'
employee_data['age'] = 30
employee_data['role'] = 'Software Engineer'
print(employee_data)

Output:

{'name': 'Alice', 'age': 30, 'role': 'Software Engineer'}

Accessing Values

To access a value in a dictionary, use the key enclosed in square brackets ([]).

Python

# Accessing values using keys
animal_sound = animal_sounds['dog']
print(animal_sound)

Output:

bark

Adding Key-Value Pairs

To add a new key-value pair to an existing dictionary, use the assignment operator (=).

Python

# Adding a new key-value pair
animal_sounds['cow'] = 'moo'
print(animal_sounds)

Output:

{'dog': 'bark', 'cat': 'meow', 'bird': 'chirp', 'cow': 'moo'}

Modifying Values

To modify the value associated with an existing key, use the assignment operator (=).

Python

# Modifying an existing value
employee_data['age'] = 32
print(employee_data)

Output:

{'name': 'Alice', 'age': 32, 'role': 'Software Engineer'}

Checking for Keys

To check if a key exists in a dictionary, use the in operator.

Python

# Checking for a key
if 'name' in employee_data:
    print('Key "name" exists in the dictionary')

Output:

Key "name" exists in the dictionary

Removing Key-Value Pairs

To remove a key-value pair from a dictionary, use the del keyword.

Python

# Removing a key-value pair
del animal_sounds['cow']
print(animal_sounds)

Output:

{'dog': 'bark', 'cat': 'meow', 'bird': 'chirp'}

Iterating over Dictionaries

To iterate over the keys in a dictionary, use a for loop.

Python

# Iterating over keys
for key in animal_sounds:
    print(key)

Output:

dog
cat
bird

To iterate over both keys and values, use a for loop with the .items() method.

Python

# Iterating over key-value pairs
for key, value in animal_sounds.items():
    print(f"Key: {key}, Value: {value}")

Output:

Key: dog, Value: bark
Key: cat, Value: meow
Key: bird, Value: chirp

Dictionaries are a powerful and versatile data structure that is essential for data manipulation in Python. They are widely used in various applications, including web development, data analysis, and machine learning.

Dictionaries are one
of the most widely used and important data structures in Python. A dictionary is a data structure that consists of a collection
of key-value pairs. They are instantiated with
braces or the dict () function. We’ll discuss that more shortly. Both veteran and entry
level data professionals use dictionaries to
analyze large data sets with fast processing power. This helps them gather and
transform user information. Dictionaries also provide a straightforward way to store data, making it easier for users
to find specific information. To use a regular dictionary,
not the data structure, but the actual book with
words and definitions, you look at the word, find
it, then read its definition. It’s essentially the same
with the Python dictionary. You look at the key, which
will let you access the values associated with that key. That’s what’s meant by “key-value pairs.” Here’s a simple example
to illustrate the concept. Suppose we have a zoo, and
the zoo has different pens that contain different animals. We could have a dictionary that stores this information for us, with the pen numbers as keys
and the animals as values. We could use this dictionary to look up which animals are in each pen. For example, if we want to know
what animals are in pen two, we type the name of the dictionary, zoo, followed by the pen in brackets. Accessing a dictionary this way always searches over the
keys and returns the values of the corresponding key. It doesn’t work the other way around. I can’t use indexing to search for zebras and find out their pen. I will get a key error. Because “zebras” is not a
key in the dictionary. Dictionaries are instantiated
mainly in two ways. The first way is with braces. With this approach, each key is separated from its value by a colon,
and each key-value pair is separated from the next by a comma. The second way to create a dictionary is with the dict function. When using the dict function, the syntax is a little different. When the keys are strings, you can type them as keyword arguments. The last time we made this dictionary, we used quotation marks to indicate that the keys were strings. Here we don’t because they’re keyword arguments. Also, instead of using
a colon between the keys and their values, you use an equal sign. The dictionary lookup is the same, irrespective of whether braces
or the dict() function is used. The dict function is also
a little more flexible in how it can be used. For example, we can create
this same dictionary once again by passing a list of lists as an argument, or even a list of tools, a tuple of tuples or a tuple of lists. They all give us the same result. If we want to add a new key-value pair to an existing dictionary, say to put crocodiles in pen four, we can assign it like this. A dictionary’s keys must be immutable. Immutable keys include,
but are not limited to, integers, floats, tuples and strings. Lists, sets, and other dictionaries are not included in this category, since they are mutable. Another important thing
to note about dictionaries is that they’re unordered. That means you can’t access them by referencing a positional index. If I try to access our zoo
dictionary at index two, I get a key error, because the computer is interpreting the two as a dictionary
key, not as an index. Also, because dictionaries are unordered, you’ll sometimes find that
the order of the entries can change as you’re working with them. If the order of your data is important, it’s better to use an ordered
data structure like a list. Finally, you can check if a keyword exists in your dictionary simply
by using the “IN” keyword. Note that this only works for keys. You can’t check for values this way. There’s a lot that we
can do with dictionaries and what we’ve reviewed
here is only the beginning. Next, we’ll consider more examples that show the power of dictionaries. You’ll also learn about some tools that make working dictionaries
easy and convenient. Meet you there.

Fill in the blank: A dictionary is a data structure that consists of a collection of _____ pairs.

key-value

A dictionary is a data structure that consists of a collection of key-value pairs. In a Python dictionary, looking up a key lets you access the data values associated with that key.

Video: Dictionary methods

Dictionaries are a cornerstone of Python’s data structures, offering a flexible and efficient approach to data organization and retrieval. Unlike lists, which rely on positional indices, dictionaries capitalize on key-value pairs, enabling rapid data lookup and manipulation.

Transforming a list of tuples into a dictionary involves iterating through the tuples using a for loop. Within each iteration, extract the key and value from the tuple and assign the key to the dictionary. For each key, maintain a list to store the corresponding values.

Python provides several powerful methods for interacting with dictionaries. The keys() method retrieves a list of all keys in the dictionary. Similarly, the values() method returns a list of all values. To access both keys and values simultaneously, utilize the items() method, which returns a list of tuples, each containing a key and its respective value.

Dictionaries stand as an indispensable tool for data analytics endeavors. Their ability to store, organize, and retrieve data efficiently makes them an integral component of the Python data science toolkit. As you delve deeper into the world of dictionaries, you’ll uncover their vast potential for data manipulation and analysis.

Exploring Dictionary Methods in Python

Dictionaries are fundamental data structures in Python that excel at storing and organizing data using key-value pairs. Unlike lists, which rely on positional indices, dictionaries offer a more flexible and intuitive approach to data management. They are widely used in various applications, including data analysis, web development, and machine learning.

Python provides a rich set of built-in methods for manipulating and interacting with dictionaries, making them even more powerful and versatile. These methods facilitate efficient data retrieval, modification, and addition, streamlining the process of data manipulation.

Essential Dictionary Methods for Data Management

  1. keys() method: The keys() method retrieves a list of all keys present in the dictionary, providing a concise overview of the dictionary’s contents.

Python

dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
keys = dictionary.keys()
print(keys)

Output:

['name', 'age', 'city']
  1. values() method: This method returns a list of all values associated with the dictionary’s keys, offering a clear view of the stored data.

Python

dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
values = dictionary.values()
print(values)

Output:

['Alice', 30, 'Seattle']
  1. items() method: The items() method retrieves a list of tuples, where each tuple contains a key and its corresponding value. This provides a combined view of the dictionary’s structure and data.

Python

dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
items = dictionary.items()
print(items)

Output:

[('name', 'Alice'), ('age', 30), ('city', 'Seattle')]
  1. get() method: This method takes a key as an argument and returns the value associated with that key. If the key does not exist, it returns None by default.

Python

dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
value = dictionary.get("name")
print(value)

value = dictionary.get("occupation")
print(value)

Output:

Alice
None
  1. setdefault() method: The setdefault() method takes a key and an optional default value as arguments. If the key exists, it returns the existing value. If the key does not exist, it adds the key to the dictionary and returns the default value.

Python

dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
value = dictionary.setdefault("name", "default_name")
print(value)

value = dictionary.setdefault("occupation", "default_occupation")
print(value)

print(dictionary)

Output:

Alice
default_occupation
{'name': 'Alice', 'age': 30, 'city': 'Seattle', 'occupation': 'default_occupation'}
  1. update() method: This method takes another dictionary as an argument and updates the current dictionary with the key-value pairs from the other dictionary.

Python

dictionary1 = {"name": "Bob", "age": 25, "city": "San Francisco"}
dictionary2 = {"occupation": "Software Engineer", "company": "Google"}
dictionary1.update(dictionary2)
print(dictionary1)

Output:

{‘name’: ‘Bob’, ‘age’: 25, ‘city’: ‘San Francisco’, ‘occupation’: ‘Software Engineer’, ‘company’: ‘Google’}

  1. copy() method: The copy() method creates a shallow copy of the dictionary. A shallow copy means that the new dictionary will have the same keys as the original dictionary, but the values will be references to the same objects as the original dictionary.

Python

dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
new_dictionary = dictionary.copy()

new_dictionary["name"] = "Charlie"
print(dictionary)
print(new_dictionary)

Output:

{'name': 'Alice', 'age': 30, 'city': 'Seattle'}
{'name': 'Charlie', 'age': 30, 'city': 'Seattle'}

These essential dictionary methods provide a solid foundation for mastering data manipulation in Python. By understanding and utilizing these methods

Previously, you were
introduced to dictionaries and learned a little
bit about how they work. Let’s continue our
exploration of dictionaries and how to use them. Let’s consider a previous example and revisit the women’s
basketball team roster. Recall that the roster was
encoded as a list of tuples. Each tuple represented the name, age, and position of a player on the team. The list of tuples was useful
when we had a single team and one player per position. What if we add more players
beyond the starting five? A dictionary can help us organize the data according to our specific needs. For example, what if we want
to be able to look up players by their position? We can create a dictionary
where the positions are the keys and the players are the values, each represented as a tuple
that contains two values: their name and age. We could retype the
information into a dictionary or cut and paste it, but if you find yourself
doing these things, you can take this as an opportunity to improve your coding skills. Consider that this is the
information for just 10 players; what if we had data for the whole league? We can convert this
information to a dictionary with a for loop and
some conditional logic. We’ll begin by instantiating
an empty dictionary and designing it to a
variable named “new_team.” The idea is to loop over
each tuple in the list, extract the position and
assign it as a dictionary key, and extract the player’s name and age and assign them as a tuple within a list, representing the value for that key. The process would repeat for
each iteration of the loop, until all of the players are
recorded in the dictionary. Notice that each position is
only represented once as a key. So the final dictionary has five keys, and there are two players in
the list at each key’s value. Now let’s write the loop. First, we’ll assign the empty dictionary to a variable called “new_team.” Then we’ll write a for loop
that unpacks the information in the original tuples. “For
name, age, position in team.” And here’s where the
conditional logic comes in. If the position already exists as a keyword in our dictionary, then we want to append
the name and age tuple to the list of tuples. Remember, the value for
each key will be a list that contains tuples
of player information. If the position does not already exist as a keyword in the dictionary,
we’ll have to assign it. We’ll use an else statement to do this. With only a few lines of code, we have converted our list
of tuples to a dictionary. Let’s check that it works. It sure does. Creating dictionaries this way is a common practice in
data analytics with Python, so learning this process
will help make you a more capable data professional. Now, let’s learn about some useful methods that we can use on dictionaries to really take advantage of their power. Specifically, we’ll
discuss the keys, values, and items methods. If you run a loop over a dictionary, the loop will only access
the keys, not the values. For example, if we loop over
the dictionary we just created and print each iteration, the computer will return five positions, the keys of the dictionary. But you don’t need to write a loop every time you want to access
the keys of your dictionary. That’s what the keys method is for. The keys method lets you retrieve only the dictionary’s keys. Returning to our “new_team” dictionary, when we apply the keys
method to the dictionary, the computer returns a
list of all its keys. Similarly, the values
method lets you retrieve only the dictionary’s values. When applied to our “new_team” dictionary, we get the values returned as a list. Since our values are lists of tuples, it means the result of calling this method is a list of lists of tuples. But what if you want to access both the keys and their values? You can, using the items
method, which lets you retrieve both the dictionary’s keys and values. Let’s use a loop to print
what the items method returns so the output is prettier. Dictionaries make storing and retrieving data fast and efficient. Keep exploring the many
things you can do with them. With time, you’ll find that
they become an important tool in your data analytics toolbox.

Reading: Reference guide: Dictionaries

Reading

Video: Introduction to sets

Sets in Python

Sets are data structures in Python that store unordered, non-interchangeable elements. Each element must be unique and immutable. Sets are mutable, meaning they can be changed after creation.

Creating Sets

Sets can be created using the set() function or non-empty braces. The set() function takes an iterable as an argument and returns a new set object.

Set Operations

Python provides built-in methods for performing common set operations, such as intersection, union, difference, and symmetric difference.

  • Intersection: Finds the elements that two sets have in common.
  • Union: Finds all the elements from both sets.
  • Difference: Finds the elements present in one set, but not the other.
  • Symmetric Difference: Finds elements from both sets that are mutually not present in the other.

Conclusion

Sets are a versatile data structure that can be used for a variety of tasks, such as storing unique values, removing duplicates from a list, and combining data from multiple sources.

What are Sets?

Sets are a fundamental data structure in Python that store collections of unique, unordered elements. They are similar to lists in that they can store multiple values, but unlike lists, sets cannot contain duplicate elements. Additionally, sets are mutable, meaning that they can be changed after creation.

Creating Sets

There are two primary ways to create sets in Python:

  1. Using the set() function: The set() function takes an iterable as an argument and returns a new set object containing the unique elements from the iterable. For example:

Python

my_set = set([1, 2, 3, 4, 5])
print(my_set)  # Output: {1, 2, 3, 4, 5}
  1. Using curly braces: Curly braces can be used to create sets by enclosing the elements within the braces. For example:

Python

my_set = {1, 2, 3, 4, 5}
print(my_set)  # Output: {1, 2, 3, 4, 5}

Common Set Operations

Python provides built-in methods for performing common set operations, such as:

  1. Union: The union of two sets is the collection of all unique elements from both sets. It is represented by the | operator. For example:

Python

set1 = {1, 2, 3}
set2 = {4, 5, 6}
set3 = set1 | set2
print(set3)  # Output: {1, 2, 3, 4, 5, 6}
  1. Intersection: The intersection of two sets is the collection of elements that are common to both sets. It is represented by the & operator. For example:

Python

set1 = {1, 2, 3}
set2 = {4, 5, 6}
set3 = set1 & set2
print(set3)  # Output: set()
  1. Difference: The difference of two sets is the collection of elements that are in the first set but not in the second set. It is represented by the - operator. For example:

Python

set1 = {1, 2, 3}
set2 = {4, 5, 6}
set3 = set1 - set2
print(set3)  # Output: {1, 2, 3}
  1. Symmetric Difference: The symmetric difference of two sets is the collection of elements that are in one set but not in the other set. It is represented by the ^ operator. For example:

Python

set1 = {1, 2, 3}
set2 = {4, 5, 6}
set3 = set1 ^ set2
print(set3)  # Output: {1, 2, 3, 4, 5, 6}

Additional Set Methods

  1. add(element): Adds an element to the set.
  2. remove(element): Removes an element from the set.
  3. pop(): Removes and returns an arbitrary element from the set.
  4. clear(): Removes all elements from the set.
  5. isdisjoint(set2): Checks whether the set has no elements in common with the specified set.
  6. issubset(set2): Checks whether the set is a subset of the specified set.
  7. issuperset(set2): Checks whether the set is a superset of the specified set.

Conclusion

Sets are a versatile and powerful data structure in Python that can be used for a variety of tasks, including storing unique values, removing duplicates from a list, and performing set operations. Understanding the basics of sets is essential for any Python programmer.

Welcome back! In this video,
we’re going to discover sets. A set is a data structure in Python that contains only unordered,
non-interchangeable elements. Sets are instantiated
with the set() function or non-empty braces. Each set element is unique and immutable. However, the set itself is mutable. Sets are valuable when storing mixed data in a single row, or a
record, in a data table. They’re also frequently used
when storing a lot of elements, and you want to be certain that each one is only present once. Because sets are mutable, they cannot be used as
keys in a dictionary. There are two ways to create a set. The first way is with the set function. The set function takes an
iterable as an argument and returns a new set object. Let’s examine the behavior of sets by passing lists, tuples, and strings through the set function. To turn the list containing “foo, bar,
baz, and foo” into a set, pass a list through the set function. And notice, the list loses the second foo. As I’ve mentioned, each
element must be unique in sets. Pass a tuple through the set function using two sets of parentheses:
one to tell the computer that the data we are
working with is a tuple; the other because the set function only takes a single argument. Again, the same result, only one foo element can
be present in the set. Finally, pass a string
through the set function. It doesn’t return the string, just the singular occurrence
of the letters in the string, O and F, in an unordered way. This is because the set function
accepts a single argument, and that argument must be iterable. A string is iterable, so
the set function splits it into individual characters and
keeps only the unique ones. The second way to instantiate
a set is with braces. However, you have to put
something inside the braces. Otherwise, the computer will
interpret your empty braces as a dictionary. Note here that instantiating
a set with braces treats what’s inside
the braces as literals. So when instantiating a set of only a single string using braces, it returns a set with a single element, and the element is the string itself. Remember, to define an
empty set or a new set, it is best to use the set function. You can only use curly braces
when the set is not empty and you are assigning
the set to a variable. Also, keep in mind that
because the elements inside a set are immutable, a set cannot be indexed or sliced. Now, let’s discuss some
additional functions you can use on sets. First, intersection finds the elements that two sets have in common. Union finds all the
elements from both sets. Difference finds the
elements present in one set, but not the other. And symmetric difference
finds elements from both sets that are mutually not
present in the other. Python provides built-in methods for performing each of these functions. Let’s start with the intersection method, denoted by the ampersand. First, define two sets. Then, apply the intersection function, either by attaching the
intersection method to set one, and passing set two to
the method’s argument, or by using the ampersand operator. Great. Now, let’s apply
the union function. Use braces to define the
two sets, X one and X two. The goal is to observe where they overlap. Print them using the union
method on the X two variable or the union operator symbol. Union is a communicable
operation in mathematics, so the overlapping values will be the same no matter what order you
put your variables in. The difference operation on sets, however, is not a communicable operation. Just like in math, if you
subtract five from seven, you get a different result than if you subtract seven from five. You can use either the difference method or the minus sign as a set operator. Subtracting set two from set one gives us only the elements in set one that are not shared with set two. But we don’t know if set
two contains any elements that are not shared with set one. The inverse is also true:
subtracting set one from set two gives us only the elements in set two that are not shared with set one. But we don’t know if set
one contains any elements that are not shared with set two. To get around this and
observe the difference between two sets mutually, use the symmetric difference function. As you might have guessed, you can use the symmetric
difference method or the symmetric difference
operator, expressed by a caret. Symmetric difference
outputs all the elements that the two sets do not share. Excellent work with sets in Python! You’ve learned so much in this section of the course already, and everything you’re
learning is preparing you for a really rewarding
career working with data. Can’t wait to be with you soon again.

In Python, what type of elements does a set contain? Select all that apply.

Unordered, Non-interchangeable

In Python, a set is a data structure that contains only unordered, non-interchangeable elements.

Reading: Reference guide: Sets

Reading

Lab: Exemplar: Dictionaries & sets

Practice Quiz: Test your knowledge: Dictionaries and sets

Fill in the blank: In Python, a dictionary’s _____ must be immutable.

In Python, what does the items() method retrieve?

A data professional is working with two Python sets. What function can they use to find all the elements from both sets?

Arrays and vectors with Numpy


Video: The power of packages

Python has many advanced features that can be used for data work and other scientific applications. These features are not included in basic Python, so it’s necessary to add them to your scripts.

  • Libraries, packages, and modules are reusable collections of code that provide additional functionality.
  • Libraries are often used interchangeably with packages.
  • Commonly used libraries for data work are matplotlib, seaborn, NumPy, and pandas.
  • Modules are accessed from within a package or a library.
  • Modules are used to organize functions, classes, and other data in a structured way.
  • Commonly used modules for data professional work are math and random.
  • There are several ways to import modules.

Video: Introduction to NumPy

NumPy is a powerful and dynamic Python library that contains multidimensional array and matrix data structures, as well as functions to manipulate them. Its power comes from vectorization, which enables operations to be performed on multiple components of a data object at the same time. This makes it particularly useful for data professionals who work with large quantities of data. NumPy is also efficient because it computes more efficiently than traditional for loops. Vectors also take up less memory space, which is another important factor when working with a lot of data. In addition to being highly useful in its own right, NumPy powers a lot of other Python libraries, like pandas, so understanding how NumPy works will help you use these other packages.

Introduction to NumPy

NumPy is a Python library that provides a powerful and efficient way to work with arrays and matrices. It is an essential tool for data scientists, machine learning engineers, and anyone who works with numerical data.

What is NumPy?

NumPy stands for “Numerical Python”. It is a library that provides a variety of functions and data structures for working with numerical data. NumPy arrays are stored in memory in a way that makes them very efficient for operations such as arithmetic, sorting, and filtering.

Why use NumPy?

There are several reasons why NumPy is a popular choice for working with numerical data:

  • Speed: NumPy arrays are stored in memory in a way that makes them very efficient for operations such as arithmetic, sorting, and filtering. This can make your code run much faster, especially when you are working with large datasets.
  • Flexibility: NumPy supports a variety of data types, including integers, floats, strings, and complex numbers. It also supports a variety of operations, including arithmetic, sorting, filtering, and statistical functions.
  • Integration with other Python libraries: NumPy is well-integrated with other Python libraries, such as pandas and Matplotlib. This makes it easy to use NumPy with other tools for data analysis and visualization.

Getting started with NumPy

To get started with NumPy, you first need to install it. You can do this using the following command:

pip install numpy

Once NumPy is installed, you can import it into your Python scripts using the following import statement:

Python

import numpy as np

Creating NumPy arrays

There are several ways to create NumPy arrays. One way is to use the np.array() function. For example, the following code creates a NumPy array of integers:

Python

array = np.array([1, 2, 3, 4, 5])

You can also create NumPy arrays from lists, tuples, and other data structures. For example, the following code creates a NumPy array from a list of strings:

Python

array = np.array(['a', 'b', 'c', 'd', 'e'])

Working with NumPy arrays

Once you have created a NumPy array, you can work with it using a variety of methods and functions. For example, the following code prints the shape of the array:

Python

print(array.shape)

The shape attribute of a NumPy array is a tuple that contains the dimensions of the array. In this case, the shape of the array is (5,), which means that the array has one dimension and contains five elements.

You can also access individual elements of a NumPy array using indexing. For example, the following code prints the second element of the array:

Python

print(array[1])

This will print the value 2.

NumPy also supports a variety of operations, including arithmetic, sorting, filtering, and statistical functions. For example, the following code adds the number 10 to each element of the array:

Python

array += 10

This will modify the array so that it contains the values 11, 12, 13, 14, 15.

Conclusion

NumPy is a powerful and versatile library that is essential for anyone who works with numerical data. It provides a variety of functions and data structures that make it easy to work with arrays and matrices. If you are working with data science, machine learning, or any other field that involves numerical data, I encourage you to learn more about NumPy.

I hope this tutorial has been helpful. Please let me know if you have any questions.

Fill in the blank: In NumPy, _____ enables operations to be performed on multiple components of a data object at the same time.

vectorization

In NumPy, vectorization enables operations to be performed on multiple components of a data object at the same time. Data professionals often work with large datasets, and vectorized code helps them efficiently compute large quantities of data.

You’ve learned that
part of what makes Python such a powerful and dynamic
language are the packages and libraries that are available to it. One of the most widely used and
important of these is NumPy. Recall that NumPy contains
multidimensional array and matrix data structures, as well as functions to manipulate them. You’ll learn more about
these data structures and functions soon, but for
now, let’s just consider NumPy at a high level and learn more about it. NumPy’s power comes from vectorization. Vectorization enables
operations to be performed on multiple components of a
data object at the same time. This is particularly useful
for data professionals because their jobs often involve working with very large quantities of data, and vectorized code saves a lot of time because it computes more efficiently. Let’s explore this a little more. Suppose we have list A and
list B, both the same length, and we want to create a new list C that’s the element-wise
product of each list. In other words, we want to
multiply the first element of list A by the first element of list B, then multiply the second element of list A by the second element
of list B, et cetera. If we try to multiply
lists A and B together, the computer throws an error. To perform this operation,
we could write a for loop. We’d start by defining
an empty list, list C, then make a range of indices to loop over, and append the product
of list A and list B at each index to list C. This gets the job done,
but if you’re thinking, “There’s gotta be an
easier way,” you’re right! We can use NumPy to perform this operation as a vectorized computation. Simply convert each list to a NumPy array and multiply the two arrays together using the product operator. The results are the same, but the vectorized approach
is simpler, easier to read, and faster to execute because while loops iterate over
one element at a time, vector operations compute simultaneously in a single statement. The efficiency of this
might not be noticeable now, but when working with
large datasets it will be. Vectors also take up less memory space, which is another factor
that becomes important when working with a lot of data. You might have noticed
that when we used NumPy, we first had to import it. This is called an import statement. An import statement
uses the import keyword to load an external
library, package, module, or function into your
computing environment. Once you import something
into your notebook and run that cell, you don’t
need to import it again unless you restart your notebook. When we import NumPy, we import it as NP. This is known as aliasing. Aliasing lets you assign an
alternate name, or alias, by which you can refer to something. In this case, we’re
abbreviating NumPy to NP. Notice the NPs in the code
below the import statement where we create our arrays. If we didn’t give NumPy an alias of NP, we’d have to type out “numpy”
here in order to access its array function. Aliasing as NP makes the code
shorter and easier to read. Note that NP is the standard alias. If you use something else, other people might get confused
when reading your code. In addition to being highly
useful in its own right, NumPy powers a lot of
other Python libraries, like pandas, so understanding
how NumPy works will help you use these other packages. There’s a lot more to
discover about NumPy. Coming up, you’ll learn about
its core data structures and functionalities. See you soon.

The Power of Packages in Python

Python is a versatile and powerful programming language that has gained immense popularity in recent years. One of the key factors contributing to Python’s success is its extensive ecosystem of packages. Packages are collections of pre-written code that provide a wide range of functionality, making it easier and more efficient to develop Python applications.

What are Packages?

In Python, a package is a collection of Python modules that are organized together under a single namespace. Packages provide a structured way to manage and share code, making it easier to reuse and extend existing code. They also encapsulate specialized functionality into modular units, promoting code reusability and maintainability.

Why Use Packages?

There are several compelling reasons to use packages in Python development:

  1. Code Reusability: Packages eliminate the need to re-invent the wheel, allowing developers to leverage existing code written by others. This saves time and effort, enabling developers to focus on the core logic of their applications rather than spending time on basic tasks.
  2. Modular Development: Packages promote modular development, breaking down complex applications into smaller, manageable modules. This modular approach enhances code organization, making it easier to understand, maintain, and debug.
  3. Specialization: Packages encapsulate specialized functionality into modular units, allowing developers to access and utilize specific features without having to understand the underlying implementation details. This promotes code reusability and reduces the learning curve for developers.
  4. Community Support: Many popular packages have active communities of developers who contribute bug fixes, updates, and new features. This ensures long-term maintenance and support for the package.
  5. Standardization: Packages promote standardization by providing consistent and well-documented code structures. This makes it easier for developers to collaborate and understand each other’s code.

How to Install and Use Packages

Installing and using packages in Python is straightforward. The pip tool, which is included with Python, is used to manage package installation. To install a package, simply run the following command in your terminal:

Bash

pip install <package-name>

Once a package is installed, you can import its modules into your Python scripts using the import statement:

Python

import <package-name>

This will make the package’s modules available for use in your script. For example, to use the pandas package for data manipulation, you would import it as follows:

Python

import pandas as pd

Popular Python Packages

The Python ecosystem boasts a vast collection of packages catering to various domains and applications. Here are some of the most popular and widely used packages:

  • NumPy: A fundamental library for scientific computing, providing efficient numerical operations on arrays and matrices.
  • pandas: A powerful data analysis library built on NumPy, providing tools for data manipulation, cleaning, and analysis.
  • Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations.
  • Seaborn: A data visualization library built on Matplotlib, providing a higher-level interface for creating common plots and graphs.
  • scikit-learn: A machine learning library providing a wide range of algorithms and tools for supervised and unsupervised learning.
  • requests: An HTTP library for making web requests and interacting with APIs.
  • BeautifulSoup: A library for parsing and extracting data from HTML documents.
  • Flask: A web framework for building web applications and APIs.
  • Django: A powerful web framework for building complex web applications with a robust architecture.

Conclusion

Packages play a pivotal role in Python development, providing a wealth of pre-written code and functionality. By leveraging packages, developers can streamline the development process, enhance code reusability, and access specialized tools for various tasks. Whether you’re building data analysis applications, web applications, or machine learning models, packages are essential for efficient and effective Python development.

Python has many advanced
calculation capabilities that are used for data work and other scientific applications. These features make it possible to extend, enhance, and reuse parts of the code. To access these features, you can import them from libraries, packages, and modules. These features are not
included in basic Python, so it’s necessary to add
them to your scripts. The additional functionality
can save you time constructing functions and
objects in your own work. Using these features can also help you obtain extra data types for analyzing data or building machine learning models. Let’s start with libraries. A library, or package, broadly refers to a reusable collection of code. It also contains related
modules and documentation. You’ll often encounter the terms library and package used interchangeably. Commonly used libraries for data work are matplotlib and seaborn. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Seaborn is a data visualization library that’s based on matplotlib. It provides a simpler interface for working with common plots and graphs. This certificate program integrates two other commonly used
libraries for data work: NumPy and pandas. NumPy, or Numerical Python, is an essential library that contains multidimensional array
and matrix data structures and functions to manipulate them. This library is used for
scientific computation. And pandas, which stands
for Python Data Analysis, is a powerful library
built on top of NumPy that’s used to manipulate
and analyze tabular data. There are many other
popular Python libraries and packages for data professional work, such as scikit-learn,
statsmodels, and others. Scikit-learn, a library, and statsmodels, a package, consist of
functions data professionals can use to test the performance of statistical models. They’re used across
various scientific fields. Scikit-learn and statsmodels
are pretty advanced, so you won’t be working
with them in this course, but you’ll have opportunities to work with these libraries
elsewhere in the program. Again, different
practitioners across the field often conflate libraries and packages, so you may hear them referred to one, the other, or both ways. Libraries and packages
provide sets of modules that are essential for data professionals. Modules are accessed from
within a package or a library. They are Python files
that contain collections of functions and global variables. Global variables differ
from other variables because these variables can be accessed from anywhere in a program or script. Modules are used to organize functions, classes, and other data
in a structured way. Internally, modules are set up through separate files that contain these necessary classes and functions. When you import a module, you are using pre-written code components. Each module is an executable file that can be added to your scripts. Commonly used modules for
data professional work are math and random. Math provides access to
mathematical functions. And random is used to
generate random numbers. This is useful when selecting random elements from a list; shuffling elements randomly; or working with random sampling, which you’ll explore in a later course. There are several ways to import modules, depending on whether you want to use the whole package or just a single, pre-defined function or feature. This adds functionality for carrying out specialized operations. There’s lots more to
learn about libraries, packages, and modules, so feel free to refer
to the course resources for more information on installing these features and to continue growing your Python knowledge. But as a reminder, you don’t
have to install anything, because everything you need to complete the different sections of this certificate program are already built into the notebooks you’ll be using in Coursera. I’ll introduce you to some
libraries in the next video.

Reading: Understand Python libraries, packages, and modules

Reading

Video: Basic array operations

In this tutorial, the speaker introduces the NumPy library and its significance in making data manipulation faster and more efficient through vectorization. The core data structure of NumPy, known as an “n-dimensional array” or ndarray, is highlighted. The tutorial covers the creation of ndarrays from Python objects, emphasizing their mutability and the ability to change values.

The importance of maintaining a consistent data type within an array is discussed, and the consequences of mixing data types are demonstrated. The tutorial also covers the use of the dtype attribute to check the data type of array contents. The concept of multidimensional arrays is introduced, including one-dimensional, two-dimensional, and three-dimensional arrays.

The shape and ndim attributes are explained as tools to confirm the structure and number of dimensions of an array. The tutorial touches on reshaping arrays using the reshape method, emphasizing its relevance in data analysis tasks. The speaker briefly mentions the vast capabilities of NumPy, including mathematical and statistical operations, and the importance of understanding NumPy basics for working with libraries like Pandas in data analysis.

The tutorial concludes by highlighting NumPy’s integral role in advanced data analysis and its frequent use in conjunction with other libraries and packages.

Basic Array Operations in Python

Arrays are a fundamental data structure in Python, used to store and manipulate collections of data. They offer an efficient way to represent and work with large datasets compared to individual variables. This tutorial will introduce you to basic array operations in Python, equipping you to handle data with ease.

1. Creating Arrays:

There are several ways to create arrays in Python:

  • Using list:

Python

my_list = [1, 2, 3, 4, 5]
print(my_list) # Output: [1, 2, 3, 4, 5]
  • Using numpy.array:

Python

import numpy as np
my_array = np.array([1, 2, 3, 4, 5])
print(my_array) # Output: array([1, 2, 3, 4, 5])
  • Using built-in functions:

Python

my_range = range(1, 6)
print(my_range) # Output: range(1, 6)

my_zeros = np.zeros(5)
print(my_zeros) # Output: [0. 0. 0. 0. 0.]

2. Accessing Elements:

Elements in an array can be accessed using their index (position) within square brackets:

Python

print(my_list[2]) # Output: 3

# Negative indexing starts from the end
print(my_list[-1]) # Output: 5

# Accessing sub-arrays
print(my_list[1:3]) # Output: [2, 3]

3. Modifying Arrays:

You can modify existing elements or add new ones using assignment:

Python

my_list[0] = 10
print(my_list) # Output: [10, 2, 3, 4, 5]

my_list.append(6)
print(my_list) # Output: [10, 2, 3, 4, 5, 6]

# Note: Be careful when modifying size, use reassignment for that

4. Array Operations:

Python provides various operators for working with arrays:

  • Arithmetic operations: Addition, subtraction, multiplication, division (element-wise)
  • Comparison operators: Less than, greater than, equal to, etc. (element-wise)
  • Logical operators: AND, OR, NOT (element-wise)

Python

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(a + b) # Output: array([5, 7, 9])
print(a > b) # Output: array([False, False, False])
print(np.all(a > b)) # Output: False

5. Useful Functions:

  • len(array): Returns the number of elements.
  • np.sum(array): Calculates the sum of all elements.
  • np.mean(array): Computes the average value.
  • np.sort(array): Sorts the elements in ascending order.

These are just a few basic operations. As you learn more about Python, you’ll discover more advanced features and libraries like NumPy that offer extensive array manipulation capabilities.

Remember, practice is key! Start exploring and experimenting with these operations on your own data sets. The more you work with arrays, the more comfortable and efficient you’ll become in handling data analysis tasks.

I hope this tutorial provides a solid foundation for your journey into the world of arrays in Python! Feel free to ask if you have any questions.

Welcome back. You’ve recently learned that the NumPy library uses vectorization to make working with data
faster and more efficient, which makes your job easier. I demonstrated how NumPy performed an element-wise
multiplication of two lists by converting the lists to arrays and then simply multiplying them together. Now, we’re going to continue
learning about arrays and how to work with them. The array is the core
data structure of NumPy. The data object itself is known as an “n-dimensional
array,” or ndarray for short. The ndarray is a vector. Recall that vectors enable many operations to be performed together
when the code is executed, resulting in faster run-times that require less computer memory. You can create an ndarray
from a Python object by passing the object to
the ndarray function. ndarrays are mutable, so you can change the values they contain. If I want to change this last
value from a four to a five, I can do that by identifying
the index number. Since we’re dealing with
the last value, here, it’s necessary to use a negative one. But, I can’t change the size of the array without reassigning it. If I try to add a number
to the end of this array, the computer throws an error. So if you want to change
the size of an array, you have to reassign it. Another requirement of the array is that all of its elements
be of the same data type. If I create an array with
the integers one, two, and then a string of “coconut,” NumPy will create an array that forces everything to the
same data type, if possible. In this case, everything becomes a string, represented here by
“U21,” meaning unicode 21. So be careful when creating your arrays that they all contain
data of the same type, or, if they don’t, that
this is intentional and useful to what you’re doing. You previously learned that calling the “type”
function on an object will return the data type of the object. If we do that with an array, as you might expect, we get a NumPy array. We can use the dtype attribute if we want to check the data type of the contents of an array. Here, the dtype attribute indicates that this array consists of integers. As the name implies, ndarrays
can be multidimensional. For a one-dimensional array, NumPy takes an array-like
object of length X, like a list, and creates an ndarray in the shape of X. A one-dimensional array is
neither a row nor a column. We can use the shape attribute to confirm the shape of an array. We can also use the ndim attribute to confirm the number of
dimensions the array has. Data professionals will
often need to confirm the shape and number of
dimensions of their array. For example, if they’re
trying to attach it to another existing array. These methods are also commonly used to help understand what’s going wrong when your code throws an error. A two-dimensional array can be
created from a list of lists, where each internal
list is the same length. You can think of these internal
lists as individual rows, so the final array is like a table. Notice that this array has a shape of four rows by two columns and is two dimensions. If a two-dimensional
array is a list of lists, then a three-dimensional array is a list that contains two of these, so a list of two lists of lists. This array can be
thought of as two tables, each with two rows and three columns. Thus, it has three dimensions. This can go on indefinitely. Thankfully, there are ways to help simplify working
with multidimensional arrays, which you’ll learn about later. And unless you’re doing very advanced scientific computations, you typically won’t be working
directly with NumPy arrays that are more than three dimensions. NumPy also lets us reshape an array using the reshape method. Our two-dimensional array was
four rows and two columns. But what if we wanted this data to be two rows by four columns? We just plug these values
into the reshape method and reassign the result back
to the array 2D variable. Reshaping data is a common
task in data analysis, so it’s important for you to be familiar with what it means and how it works. There are many other operations that can be performed with arrays, and you’ll surely learn those as the need for them
arises in your projects. But there are other helpful
functions and methods in NumPy that you’ll use regularly. These include functions to
calculate things like mean, and natural logarithm, and floor and ceiling operations, which round numbers to the nearest lesser and greater whole number, respectively. And many other frequently used mathematical and statistical operations. NumPy is very robust. There are so many things
you can do with it that we can only briefly
consider them here. As you know, NumPy powers many other useful libraries and packages. In this certificate program, we won’t be working a
lot directly with NumPy, but we will be working
a lot with a library that depends on it: Pandas. It’s important that you
understand the basics of NumPy because it will help you when
you start working with Pandas. As you develop your skills
as a data professional, you’ll find yourself returning
to NumPy time and time again because it’s such an integral part of advanced data analysis. Bye for now.

Reading: Reference guide: Arrays

Reading

Exemplar: Arrays and vectors with NumPy

Practice Quiz: Test your knowledge: Arrays and vectors with NumPy

Python libraries and packages include which of the following features? Select all that apply.

What is the core data structure of NumPy?

A data professional wants to confirm the datatype of the contents of array x. How would they do this?

Dataframes with pandas


Video: Introduction to pandas

Main Points:

  • Pandas is a powerful Python library for data manipulation and analysis.
  • It provides a convenient interface for working with tabular data.
  • Pandas allows for easy data loading from various formats (CSV, Excel, etc.).
  • Key functionalities include:
    • Data manipulation and filtering.
    • Statistical analysis (mean, min, max, standard deviation, etc.).
    • Data grouping and aggregations.
    • Custom calculations.
  • Pandas simplifies data analysis tasks and provides intuitive visualization.

Key Takeaways:

  • Pandas is an essential tool for data professionals and data analysis tasks.
  • Its user-friendly interface makes it easier to work with data than NumPy.
  • Pandas offers a wide range of functionalities for efficient data manipulation and analysis.
  • Learning Pandas opens doors to powerful data exploration and insights.

Additional Notes:

  • Dataframe is the core data structure in Pandas, representing tables with rows and columns.
  • Pandas supports various data types like integers, floats, strings, and booleans.
  • Filtering allows focusing on specific subsets of data based on simple or complex logic.
  • Pandas enables manipulation of data, such as adding calculated columns.
  • Grouping and aggregation facilitate analysis of data subsets based on specific criteria.

What is Pandas?

Pandas is a powerful and popular Python library for data analysis and manipulation. It provides efficient data structures and functions for working with tabular data, commonly known as spreadsheets or tables with rows and columns. Think of it as a toolbox specifically designed to handle and analyze your data with ease.

Why Use Pandas?

There are several reasons why Pandas is a preferred choice for data analysis in Python:

  • Simplicity: Pandas offers a user-friendly interface that makes it easier to comprehend and work with data compared to lower-level libraries like NumPy.
  • Efficiency: Pandas provides optimized data structures and algorithms for efficient data handling, leading to faster processing and analysis.
  • Functionality: Pandas is packed with various functionalities for data manipulation, analysis, and visualization.
  • Versatility: Pandas supports loading data from various formats like CSV, Excel, databases, and more.
  • Popularity: Pandas is a widely used and well-supported library, providing a vast community for support and learning resources.

Getting Started with Pandas

1. Installation:

Ensure you have Python installed and run the following command to install Pandas:

Bash

pip install pandas

2. Import Pandas:

In your Python script, import Pandas using the following syntax:

Python

import pandas as pd

3. Loading Data:

Pandas offers various functions for loading data from different sources. Here are some common examples:

Python

# Load data from a CSV file
data = pd.read_csv("data.csv")

# Load data from an Excel spreadsheet
data = pd.read_excel("data.xlsx")

# Load data from a URL
data = pd.read_url("https://example.com/data.csv")

4. Exploring Data:

Once you have loaded your data into a Pandas dataframe, you can explore it using various methods:

  • Accessing specific data: Use indexing and slicing to access rows, columns, or individual cells.
  • Descriptive statistics: Calculate summary statistics like mean, standard deviation, minimum, and maximum values.
  • Data filtering: Select subsets of data based on specific conditions and criteria.
  • Data sorting: Sort data based on specific columns in ascending or descending order.

5. Data Manipulation:

Pandas allows you to modify and manipulate your data:

  • Adding and removing columns: Add new columns based on calculations or logic, or remove unnecessary columns.
  • Editing data: Modify existing values in your dataframe.
  • Merging and joining dataframes: Combine data from multiple sources.

6. Data Visualization:

Pandas offers built-in functions for visualizing your data:

  • Series plots: Create line plots, bar charts, etc., for single data series.
  • DataFrame plots: Generate scatter plots, heatmaps, boxplots, and more.

Learning Resources:

Here are some resources to learn more about Pandas:

Next Steps:

This tutorial provides a brief introduction to Pandas. As you progress, you’ll delve deeper into its functionalities, explore advanced data analysis techniques, and master the art of data manipulation and visualization with Pandas.

Hi there. We previously discussed NumPy, and how it’s an important tool for data professionals and anyone else whose job requires high
performance computational power. We also investigated how other libraries and packages use NumPy because of the efficiencies
that come with vectorization. One of these libraries is pandas, a quintessential tool both
in this certificate program and in the world of data analytics. In this lesson, you’re going
to learn more about pandas and why it’s so useful. Because pandas is a library that adds functionality
to Python’s core toolset, you have to import it. Similar to how we imported NumPy as NP, pandas has its own standard alias of PD. Typically, when using pandas, you import both NumPy and pandas together. This is just for convenience, given that NumPy is often used
in conjunction with pandas. Strictly speaking, you don’t have to import
NumPy to work in pandas. Pandas is fully operational on its own. Pandas’ key functionality is the manipulation and
analysis of tabular data – that is, data that’s in the form of a table, with rows and columns. A spreadsheet is a common
example of tabular data. While NumPy is capable of many of the same functions
and operations as pandas, it’s not always as easy to work with because it requires you to work more abstractly with the data and keep track of what’s being done to it, even if you can’t see it. Pandas, on the other hand, provides a simple
interface that allows you to display your data as rows and columns. This means that you can
always follow exactly what’s happening to your
data as you manipulate it. In this video, I’ll give you
a demonstration of pandas, and what it’s like to use it. Later, we’ll go into greater detail on its unique classes,
processes, and functions. First of all, you can load
data into pandas easily from different formats like comma-separated value files, or CSVs, Excel, and other spreadsheets,
databases, and more. Here, I’m loading a CSV file that I’m accessing via a web URL. The file contains information for some of the passengers
from the Titanic, including their names,
what class ticket they had, their age, ticket price, and cabin number. By the way, this table of
data is called a dataframe. The dataframe is a core
data structure in pandas. Notice that the dataframe is
made up of rows and columns, and it can contain data of
many different data types including integers, floats,
strings, booleans, and more. If I want to calculate the
average age of the passengers, we do so by selecting the age column and calling the mean method on it. I can also get the max, min, and standard deviation
with minimal effort. I can also quickly check how many passengers were in each class. Checking summary statistics
of the entire dataset only requires one line of code. This method gives me the number of rows as well as the mean, standard deviation, minimum and maximum values, along with the quartiles
for every numeric column. These concepts are all covered in greater depth elsewhere in the program. For now, I just want you
to pay close attention to the power of pandas and all that you can accomplish with it. Pandas also allows me to filter based on simple or complex logic. For example, here I’m selecting only the third class passengers
who were older than 60. In addition to all of
these data analysis tools, pandas also gives us ways to
manipulate and change the data. For example, I can add
a column that represents the inflation adjusted price
of a ticket from 1912 to 2023. Florence Briggs Thayer paid 71.28 pounds for her first class ticket. Today, that ticket would have cost her 10,417 pounds sterling. If you’re wondering
how I knew her name was Florence Briggs Thayer, it’s because I can also
select rows, columns, or individual cells from
the data using indexing. Her name is in row one, column three. I can also do more complex data
groupings and aggregations. For example, here I’m grouping the
passengers by class and sex, and then calculating the mean cost of a ticket for each group. Hopefully you’re excited to
start working with pandas. I know I’m looking forward to
guiding you as you learn more about this powerful and
fun data analysis tool.

Video: pandas basics

Main Points:

  • Pandas is a powerful library for data analysis and manipulation.
  • It has two main object classes: DataFrames and Series.
  • DataFrames are two-dimensional, labeled data structures with rows and columns.
  • Series are one-dimensional, labeled arrays.
  • DataFrames can be created from various data sources like dictionaries, NumPy arrays, and CSV files.
  • They have many useful methods and attributes for accessing and manipulating data.
  • Selecting and referencing parts of a DataFrame is done using bracket notation or dot notation.
  • iloc is used for integer-location-based selection, while loc is used for selection by label.
  • New columns can be added to a DataFrame using simple assignment statements.

Key Takeaways:

  • DataFrames are the primary data structure in pandas.
  • Series are useful for representing individual columns or rows.
  • Pandas offers various methods for creating and manipulating DataFrames.
  • Indexing and slicing are used to select parts of a DataFrame.
  • iloc and loc provide different ways for selecting based on location or label.
  • DataFrames are flexible and powerful tools for data analysis.

Additional Notes:

  • NaN represents null values in pandas.
  • Series with mixed data types are classified as “object” type.
  • Dot notation is preferred for simple selections, while brackets are better for complex code.
  • The pandas documentation is a valuable resource for learning more about its features and capabilities.

Conclusion:

This tutorial provides a solid introduction to pandas DataFrames and Series. It covers basic creation, manipulation, and selection techniques. By understanding these key concepts, you can begin working with pandas effectively and explore its vast potential for data analysis. Remember, the documentation is always available to guide you as you advance your pandas skills.

Introduction:

Welcome to the exciting world of pandas! This tutorial will guide you through the essential concepts of pandas, equipping you with the foundational knowledge to work effectively with data in Python.

1. What is Pandas?

Pandas is a powerful and popular open-source library in Python specifically designed for data analysis and manipulation. It offers a wide range of functionalities, including:

  • Data structures: Pandas provides efficient data structures like DataFrames and Series for organizing and managing tabular data.
  • Data manipulation: Powerful methods and functionalities for cleaning, transforming, and analyzing data.
  • Data analysis: Tools for statistical analysis, visualization, and exploration.

2. Core Data Structures:

  • DataFrame: A two-dimensional, labeled data structure with rows and columns. Think of it as a spreadsheet or a SQL table.
  • Series: A one-dimensional, labeled array. Often used to represent individual columns or rows of a DataFrame.

3. Creating DataFrames:

There are several ways to create DataFrames:

  • From dictionaries: Keys become column names and values become list elements in each column.
  • From NumPy arrays: Arrays are converted to DataFrames with rows and columns.
  • From CSV files: Pandas provides the read_csv function to import data from CSV files.
  • From other data sources: Pandas supports various data sources like JSON, Excel files, and databases.

4. Accessing Data:

  • Column selection: Use bracket notation with column names or dot notation (for simple names).
  • Row selection: Use integer-based indexing (iloc) or label-based indexing (loc).
  • Accessing specific values: Use bracket notation with two indices (row and column).

5. Basic Operations:

  • Data cleaning: Handling missing values, formatting data types, and removing outliers.
  • Data manipulation: Filtering, sorting, merging, and joining DataFrames.
  • Data analysis: Descriptive statistics, group-by operations, and time series analysis.

6. Essential Methods:

  • head(): Shows the first few rows of a DataFrame.
  • tail(): Shows the last few rows of a DataFrame.
  • info(): Displays information about the DataFrame, including data types and null values.
  • describe(): Provides descriptive statistics for numerical columns.
  • sort_values(): Sorts a DataFrame by a specific column.
  • groupby(): Groups data by a specific column and performs operations on each group.

7. Visualization:

Pandas integrates seamlessly with matplotlib and seaborn for data visualization. You can create various charts and graphs to explore and understand your data.

8. Conclusion:

This tutorial serves as a starting point for your journey with pandas. As you practice and explore, you will discover its immense capabilities and become a proficient data analyst. Remember, the pandas documentation is your valuable resource for learning more about specific functionalities and advanced techniques.

Additional Resources:

Next Steps:

  • Practice the concepts covered in this tutorial with real data sets.
  • Explore more advanced pandas functionalities like data aggregation, time series analysis, and plotting.
  • Learn about other libraries like matplotlib and seaborn for data visualization.
  • Participate in online communities and forums to connect with other data enthusiasts and learn from their experiences.

Remember, the key to mastering pandas is continuous practice and exploration. So, keep learning, keep coding, and keep analyzing data!

Fill in the blank: In pandas a dataframe is a _____-dimensional, labeled data structure. 

two

In pandas a dataframe is a two-dimensional, labeled data structure. A dataframe is organized into rows and columns.

Now that you have a good understanding of the core structures
and routines of Python, and some of the basics of NumPy, you’re ready to start working with pandas. Pandas is one of the primary
tools that you’ll use throughout the rest of
this certificate program, as well as in a large and growing number of data professions. In this video, you’ll learn about the main classes
in the pandas library and some important ways to work with them. Pandas has two core object
classes: dataframes and series. Let’s begin with a review of dataframes. A dataframe is a two-dimensional, labeled data structure with rows and columns. You can think of a dataframe like a spreadsheet or a SQL table. It can contain many
different kinds of data. Data professionals use dataframes
to structure, manipulate, and analyze data in pandas, just like we did in the previous video with the Titanic example. We can create a dataframe using the pandas DataFrame function. This function has a lot of flexibility, and can convert numerous data formats to a DataFrame object. In this example, we created a
dataframe from a dictionary, where each key of the dictionary
represents a column name, and the values for that key are in a list. Each element in the
list represents a value for a different row at that column. We can also create one from a NumPy array resembling a list of lists, where each sub-list
represents a row of the table. Notice that in this example, we included separate keyword arguments for columns and index. This approach lets us name the columns and rows of the dataframe. These are just a couple
of the many different ways to create a dataframe with
the DataFrame function. For examples of some
others, be sure to review the available pandas
documentation on this topic. Often, data professionals
need to be able to create a dataframe from existing data that’s not written in Python syntax. For example, maybe we want to take an existing spreadsheet and
manipulate it in pandas. Spreadsheets can be saved as CSV files, which can then be read
into pandas as a dataframe. CSV stands for comma-separated values, and it refers to a plaintext file that uses commas to separate distinct values from one another. Here is a sample of the first
few lines of source data from the Titanic dataset
that we used previously. This is what a CSV file looks like. In this file, you’ll find
values for passenger name, age, sex, fare, and more. Notice that a comma is used to separate each value from the next. To create a dataframe from a CSV file. pandas has the “read CSV” function. Here’s the same Titanic data
rendered as a dataframe. For the sake of an example,
it’s defined here as df3. The “read CSV” function
can read files from a URL, like in this example, and
it can also read files directly from your hard drive. Instead of a URL, you’d just provide the file path to your file. Now, let’s discuss the other
main class in pandas: Series. A Series is a one-dimensional,
labeled array. Series objects are most often used to represent individual
columns or rows of a dataframe. So, if we select a row or a column from this Titanic dataframe
and call “type” on it, it will return as a pandas series object. Like dataframes, individual series can be created from various data objects, including from NumPy arrays, dictionaries, and even scalars. Again, refer to the pandas
documentation for examples. Now, let’s use the Titanic dataset to review some of the basics of working with dataframes and series. The DataFrame and Series classes have many super useful
methods and attributes that make common tasks easier. Remember, a method is a function
that belongs to a class. It performs an action on the object. An attribute is a value
associated with a class instance. It typically represents a
characteristic of the instance. Both methods and attributes are
accessed using dot notation, but methods use parentheses,
while attributes do not. Earlier in the video, we
named the Titanic dataset “df3,” but let’s change the name
to “titanic” for clarity. We can do this by simple reassignment. If we want to access the
“columns” of the dataframe, we can use the columns attribute. This returns an index of
all of the column names. We can use the shape attribute to check the number of rows and columns
contained in the dataframe. This dataframe has 891
rows and 12 columns. And we can get some summary
information about the dataframe by calling the info method. This tells us that there
are 891 rows and 12 columns, and it also gives us the column names, the data type contained in each column, the number of non-null
values in each column, and the amount of memory
the dataframe uses. By the way, I want to
address a couple of points about terminology in pandas. First, null values in pandas
are represented by NaN, which stands for “not a number.” And second, if a Series object contains mixed or string data types, when you check its data type, it will come back as an “object.” This is an example of how
pandas is built on NumPy, but the details of this are
beyond the scope of this video. One of the most common
tasks when working in pandas is selecting or referencing
parts of the dataframe. This has many similarities
with indexing and slicing. For example, if you want
to select a single column, you can type the name of the dataframe followed by brackets,
and within the brackets enter the name of the column as a string. This returns a Series
object of that column. You can also use dot
notation, but this only works if the column name does not
contain any whitespaces. Using dot notation is faster to type, so for very simple lines of
code, you may prefer to do this. But if the code begins
to get more complex, it’s generally better
to use bracket notation, because it makes the code easier to read. To select multiple columns
of a dataframe by name, use bracket notation. Within the brackets, enter
a list of column names. This returns a view of your dataframe as a new DataFrame object. If you want to select
rows or columns by index, you’ll need to use iloc. iloc is a way to indicate in pandas that you want to select by
integer-location-based position. If you enter a single integer
into the iloc brackets, you’ll get a series object representing a single row of your
dataframe at that index. Because I entered 0 here, I got the very first row in
my dataframe as a series. If you enter a LIST of a single integer in the iloc brackets, you’ll
get a DataFrame object of a single row of the
dataframe at that index. You can access a range of rows by entering the indices of
the beginning and ending rows separated by a colon. Pandas will return every index starting with the beginning index up to, but not including, the last index. So zero colon three returns row indices 0, 1, and 2. You can select subsets of rows
and columns together, too. This returns a dataframe
view of rows 0, 1, and 2 at columns 3 and 4 only. So, if you want a single
column in its entirety, you select all rows, and then
enter the index of the column you want. And, you can
even get a single value at a particular row in a particular column by using two indices separated by a comma. Loc is similar to iloc,
but instead of selecting by index location, loc is used to select pandas rows and columns by name. Let’s investigate loc with
the Titanic dataframe… In this example, I’m
selecting rows 1, 2, and 3 at just the “Name” column. Note that in this example,
we’re referring to the rows with numbers, even though
we’re using loc to select. This is because our rows
are indexed by number. If we had a named index, however, we’d have to use row names, like what we’re doing for columns. And one more thing. If you want to add a new
column to a dataframe, you can do that with a
simple assignment statement. Now we have a new column at the end here. There are so many things
you can do in pandas, and I can only share so
much in the time we have. As always, the documentation
is your friend. There will inevitably be times where you need to do something that wasn’t explicitly covered here. In those cases, the
documentation almost always has simple examples that demonstrate how to do the thing you need to do. There’s still more to come though, so I’ll meet you soon.

Reading: The fundamentals of pandas

Reading

Video: Boolean masking

The video discusses various methods of data selection in a pandas dataframe, focusing on filtering based on value-based conditions using Boolean masking. Boolean masking involves overlaying a Boolean grid onto a dataframe, selecting values aligned with True values. An example with a “planets” dataframe illustrates creating a boolean mask for planets with fewer than 20 moons. The tutorial demonstrates applying the mask and introduces logical operators for multiple conditions (and, or, not). The importance of using parentheses for logical statements is emphasized. The video concludes by highlighting the flexibility of pandas for data selection and encourages practice to master these techniques.

This tutorial provides a comprehensive overview of Boolean masking, a powerful technique for filtering data based on specific conditions in Python. It delves into the usage of Boolean masking with various data structures, including lists, NumPy arrays, and pandas DataFrames, offering practical examples and explanations to enhance your understanding.

1. Introduction to Boolean Masking:

Boolean masking leverages the concept of Boolean arrays in Python. These one-dimensional arrays contain True and False values corresponding to elements in the data structure being filtered. By manipulating these values, you can selectively choose the desired data points.

2. Filtering Lists with Boolean Masking:

Let’s consider a list of numbers:

Python

numbers = [1, 4, 5, 2, 7, 3]<br>

To filter and keep only the numbers greater than 4:

Python

# Create a mask with True for numbers > 4
mask = [number > 4 for number in numbers]

# Filter the list using the mask
filtered_numbers = [number for number, is_greater in zip(numbers, mask) if is_greater]

print(filtered_numbers)  # Output: [4, 5, 7]

This code creates a mask with True for numbers greater than 4 and uses it to filter the original list, resulting in a new list containing only those elements.

3. Filtering NumPy Arrays with Boolean Masking:

Similarly, boolean masking can be applied to NumPy arrays:

Python

import numpy as np

data = np.array([1, 4, 5, 2, 7, 3])

# Create a mask with True for even numbers
mask = data % 2 == 0

# Filter the array using the mask
filtered_data = data[mask]

print(filtered_data)  # Output: [1 2]

This code creates a mask for even numbers and uses it to filter the array, resulting in a new array containing only even elements.

4. Filtering Pandas DataFrames with Boolean Masking:

Boolean masking offers significant benefits with pandas DataFrames, enabling sophisticated data filtering based on specific column values.

Python

import pandas as pd

data = pd.DataFrame({
    "name": ["Mary", "John", "Peter", "Alice", "David"],
    "age": [25, 30, 28, 22, 32],
    "city": ["New York", "London", "Paris", "Berlin", "Tokyo"]
})

# Create a mask for people older than 25
mask = data["age"] > 25

# Filter the DataFrame using the mask
filtered_data = data[mask]

print(filtered_data)

This code filters the DataFrame for people older than 25, demonstrating the filtering capabilities of Boolean masking with DataFrames.

5. Complex Boolean Expressions:

You can combine logical operators like and, or, and not to construct complex filtering criteria:

Python

# Filter people in New York or Paris who are younger than 28
mask = (data["city"] == "New York") | (data["city"] == "Paris") & (data["age"] < 28)

filtered_data = data[mask]

print(filtered_data)

This code illustrates how to use multiple conditions and logical operators to achieve more specific filtering within a DataFrame.

6. Benefits of Boolean Masking:

Boolean masking offers several advantages:

  • Efficient: Filtering large datasets is highly efficient with Boolean masking.
  • Flexible: It adapts to various data structures and allows for complex filtering conditions.
  • Readable: Code using Boolean masking is often clear and easy to understand.

7. Conclusion:

Boolean masking is a powerful and versatile technique for filtering data in Python. This tutorial provides a solid foundation for implementing Boolean masking with different data structures. Remember to practice and experiment to master this valuable tool for effective data analysis.

Previously, we investigated
several different ways of selecting data in a dataframe, including column selection, row selection, and selection of combinations
of rows and columns using name-based and
integer-based indexing. In this video, you’ll learn how to filter the data in the dataframe based on value-based conditions. You know that boolean is used to describe any binary variable with two possible values: true or false. With pandas, you can
use a powerful technique known as Boolean masking. Boolean masking is a filtering technique that overlays a Boolean
grid onto a dataframe in order to select only
the values in the dataframe that align with the
True values of the grid. Data professionals use
boolean masking all the time, so it’s important that you
understand how it works. Here’s an example. Suppose you have a dataframe
of planets, their radii, and the number of moons each planet has. Now, suppose that you
want to keep the rows of any planets that
have fewer than 20 moons and filter out the rest. A boolean mask is a panda Series object indicating whether this
condition is true or false for each value in the “moons” column. The data contained in
this series is type bool. Boolean masking effectively
overlays this boolean series onto the dataframe’s index. The result is that any
rows in the dataframe that are indicated as
True in the Boolean mask remain in the dataframe, and any row that are indicated
as False get filtered out. Here’s how to perform
this operation in pandas. We’ll begin by creating a dataframe from a predefined dictionary using the pandas DataFrame function. The dataframe is called “planets.” The next step is to
create the boolean mask by writing a logical statement. Remember, the objective is to keep planets that have fewer than 20 moons
and to filter out the rest. So, we define the mask by writing “planets at the moons
column is less than 20″. This results in a Series object that consists of the row indices where each index contains
a True or False value depending on whether that row
satisfies the given condition. This is the boolean mask. To apply this mask to the dataframe, insert it into selector brackets and apply it to the dataframe. It’s also possible to
apply the conditional logic directly to the dataframe,
skipping the part where we assign it to a
variable named “mask.” But breaking out the steps individually can make the code easier to follow. Note that we haven’t permanently
modified the dataframe. Applying the boolean mask
using the conditional logic only gives a filtered “view” of it. So, when you call the
planet’s variable again it returns the full dataframe. However, we can assign the
result to a named variable. This may be useful if you’ll need to reference the list of planets with moons under 20 again later. Sometimes you’ll need to filter data based on multiple conditions. Pandas uses logical operators to indicate which data to keep and which to filter out and statements that use
multiple conditions. These operators are:
The ampersand for “and” the vertical bar for “or”
and the tilde for “not”. Let’s review how this works. Here’s how to create a boolean mask that selects all planets that have fewer than 10 moons
or greater than 50 moons. Notice that each condition
is self-contained and a set of parentheses, and the two conditions are
separated by a vertical bar, which is the logical operator
that represents “or”. It’s very important that each component of the logical statement
be in the parenthesis. Otherwise, your statement
will throw an error, or worse, return something that
isn’t what you intended. To apply the mask, call the
dataframe and put the statement or the variable it’s assigned
to in selector brackets. Here’s an example of how
to select all planets that have more than 20 moons, but not planets with 80 moons and not planets with a radius
less than 50,000 kilometers. Let’s break it into pieces. First is a statement for planets. In parentheses, we put
“planets at the moons column must be greeter than 20”
and close the parenthesis. Then we use the “and” “not”
operators before parentheses “planets at the moons column
equals 80,” close parentheses. Then we again use the
“and” “not” operators before parentheses “planets at the radius
column less than 50,000,” close parentheses. When we apply the mass to the dataframe, we’re left with just one planet: Saturn, with 83 moons and a
radius of 58,232 kilometers. There are a near infinite number of ways to select and filter data
using the basic tools that you’ve learned so far. As always, it takes a lot of practice before you know exactly how to execute every selection statement
that your work requires. So make sure to bookmark
any and all resources that you find helpful to reference. Keep up the good work and I’ll
meet you in the next video.

Reading: Boolean masking in pandas

Reading

Video: Grouping and aggregation

This video dives into grouping and aggregating data in pandas using the groupby and agg methods.

Key takeaways:

  • groupby groups rows based on values in one or more columns, allowing further analysis.
  • Applying methods like sum, mean, or count to groupby objects aggregates data within each group.
  • Multiple columns can be used for grouping and aggregation.
  • agg allows applying multiple calculations to groups simultaneously.
  • Custom functions can be defined and used within agg for specific calculations.

Benefits of grouping and aggregating:

  • Gain deeper insights into data patterns and relationships.
  • Prepare data for visualization and analysis.
  • Summarize large datasets efficiently.

Real-world applications:

  • Analyzing financial data by customer segments
  • Understanding website traffic by device type
  • Exploring scientific data by experiment parameters

Overall:

groupby and agg are powerful tools for data analysis, enabling you to uncover hidden trends and make informed decisions based on your data.

Remember:

  • Start with small examples to understand the mechanics.
  • Explore and experiment to discover the full potential of these methods.

Pandas, the Python library for data analysis, empowers you to unlock hidden stories within your data. One of its most powerful tools is the ability to group and aggregate data, allowing you to see the big picture and understand trends across different categories. This tutorial will equip you with the knowledge and skills to master this essential skill.

1. Understanding GroupBy:

At its core, groupby groups rows in your DataFrame based on shared values in one or more columns. Imagine sorting your data like books on a shelf, with each shelf representing a group. This allows you to analyze and compare data within each group, revealing patterns and relationships not readily visible in the raw data.

2. Putting GroupBy into Action:

Let’s say you have a DataFrame of customer purchases, with columns like product, price, and customer ID. Here’s how you can use groupby:

Python

# Group by product
grouped_by_product = df.groupby("product")

# Calculate total sales for each product
total_sales = grouped_by_product["price"].sum()

# Print the result
print(total_sales)

This code groups the DataFrame by the “product” column and then calculates the total sales for each product using the sum function.

3. Exploring Aggregation Functions:

groupby isn’t just about sums. Pandas provides a plethora of aggregation functions like mean, median, max, min, and count to analyze your data groups in various ways. You can even use custom functions for specific calculations.

4. Leveling Up with Multiple Columns:

For even deeper analysis, you can group by multiple columns. For example, you can group by both product and customer ID to understand individual customer preferences for different products.

5. Aggregating Multiple Functions:

agg comes in handy when you want to apply multiple functions to each group. Imagine calculating the average price and total number of purchases for each product. You can achieve this with:

Python

grouped_by_product.agg({"price": ["mean", "sum"]})

This code returns a DataFrame with two columns for each product: average price and total sales.

6. Real-World Applications:

Grouping and aggregation are used in various domains:

  • Finance: Analyze spending patterns across customer segments.
  • Marketing: Understand website traffic by device type and campaign.
  • Science: Compare experimental results across different conditions.

7. Beyond the Basics:

This tutorial provides a foundational understanding. As you progress, explore advanced concepts like:

  • Hierarchical grouping for nested analysis.
  • Chaining operations for efficient data manipulation.
  • Applying custom functions for unique insights.

Remember: Practice is key! Experiment with your data and discover the powerful stories hidden within groups. With consistent practice, you’ll become a master of grouping and aggregation, unlocking the full potential of your data analysis in Pandas.

Additional Resources:

Start your journey today and unlock the secrets of your data!

Now that you’ve learned how to select and filter data using name- and location-based indexing, as well as boolean masking, it’s time to take the next step. In this video, you’ll learn how to group
your data, aggregate it, and perform calculations
on these groupings to help you discover what
the data is telling you. One of the most important and commonly used tools to group data in pandas is the groupby method. Groupby is a pandas DataFrame method that groups rows of the dataframe together based on their values
at one or more columns, which allows further
analysis of the groups. Let’s return to our planets dataset to demonstrate some different ways to use the groupby method. This time we’ll add a little more data, including the type of planet, whether or not it has a ring system, average temperature in degrees Celsius, and whether or not it has
a global magnetic field. As always, when learning a new data tool, it’s helpful to begin by applying it to a small example. This will better enable you to understand exactly what’s happening. First, let’s examine the mechanics of what happens when you use groupby. When you call the groupby
method on a dataframe, it creates a groupby object. If you do nothing else, the groupby object isn’t very helpful. You’ll basically get a statement saying, “Here’s your object. “It’s stored at this address
in the computer’s memory.” But once you have this object, there are all kinds of
things you can do with it. For example, if we group the dataframe by the “type” column and then apply the sum method to the groupby object, the computer returns a dataframe object with three rows, one for each planet type, and three columns, one for each numerical column. Only the numerical columns are returned because the sum method only works on numerical data. The “type” column is an
index of this dataframe. This information can be interpreted as the sum of all the values in each group at these respective columns. So, for example, radii of all the gas giant planets sum to 128,143 kilometers. That information probably isn’t very useful in most cases, but the total number of moons could definitely be something
we want to calculate. If you want to isolate the information at particular columns, just insert the columns as a list in selector brackets following
the groupby statement. You can also use other
methods instead of sum. For example, min, max,
mean, median, count, and many others. Groupby will work on multiple columns too. When we pass a list containing the type and magnetic field columns to the groupby method and then apply the mean
method to the result, we get a dataframe that contains a row for each unique combination of planet type and magnetic field. Again, the columns contain the mean calculated values for each
numerical column for each group. Groupby is very useful because it helps you to better
understand your data. It’s also useful to organize data that you want to plot on a graph. You’ll learn more about this later. Another important method to use on the groupby objects is the agg method. Agg is short for “aggregate.” This method allows you to apply multiple calculations to groups of data. Let’s start simple. What if we want to group the planets by their type, and then calculate both the mean and the median values of the numeric columns for each group? We call the agg method
after the groupby statement. In its argument field, we enter a list of the calculations we want to apply to the data. If these calculations are existing methods of groupby objects, they can just be entered as strings. We can group by multiple columns and apply multiple aggregation functions to each group. For instance, we can group the planets by type and whether or not they
have a magnetic field, and then use the agg method to calculate the mean and max values of each group. And we can even define our own functions and apply them. For example, suppose we want to calculate the 90th percentile of each group. We can define a function called “percentile 90” that
uses the quantile method on the array and returns the value
at the 90th percentile. Then we can call this custom
function in our aggregation. Notice that we can enter “mean” as a string because
it’s an existing method of groupby objects, but we type the “percentile 90” function as an object because it’s custom-defined. Groupby and aggregate are two tools that together can give deep insight into the story that your
data is telling you. The types of calculations that we just reviewed are daily tasks of data professionals
in nearly every field. Even though we only applied them to a very tiny dataset, these same exact operations would work on a dataset of every
planet in the galaxy, if we knew them all and if we had enough computing power to perform the aggregations! There’s a lot more we can do with groupby and aggregate, and as always I encourage you to explore more on your own, but you should now have
a solid understanding of how and when to apply these tools.

Reading: More on grouping and aggregation

Reading

Video: Merging and joining data

Concatenation (concat):

  • Combines dataframes horizontally (new columns) or vertically (new rows).
  • Useful for adding additional data that shares the same format as the existing dataframe.
  • Uses “axis” keyword to specify vertical (0) or horizontal (1) concatenation.
  • Example: Combining dataframes with planet information (radius, moons) vertically.

Merge:

  • Joins two dataframes based on a shared key (e.g., planet name).
  • Different types of joins:
    • Inner: Only keeps rows with keys present in both dataframes.
    • Outer: Includes all keys from both dataframes, filling missing values with NaN.
    • Left: Includes all keys from the left dataframe and matching keys from the right.
    • Right: Includes all keys from the right dataframe and matching keys from the left.
  • Example: Combining dataframes with planet information and additional details (type, rings, temperature) using various join types.

Key Takeaways:

  • Choose concat for adding data vertically with the same format.
  • Use merge for joining data based on shared keys, specifying the desired join type.
  • Understanding these tools is crucial for data analysis and manipulation in pandas.

Pandas, the data analysis workhorse, excels at wrangling and manipulating data. One of its most powerful features is the ability to merge and join dataframes, allowing you to combine information from multiple sources into a single, unified dataset. This tutorial will equip you with the essential knowledge to conquer these tasks with confidence.

Understanding the Basics:

Merging and joining are related concepts, but they have subtle differences:

  • Merging: Combines dataframes based on a shared key (e.g., customer ID).
  • Joining: Combines dataframes based on a relationship between them (e.g., orders and products).

Both operations result in a single dataframe containing information from the combined sources.

Choosing the Right Tool:

  • Merge: Use a merge when you need to match and combine data based on a specific key. For example, merging customer data with order data based on customer ID.
  • Join: Use a join when you need to combine data based on a broader relationship. For example, joining product information with a sales dataframe based on product ID.

Merging with pd.merge:

pd.merge is the workhorse function for merging dataframes. It takes several arguments:

  • left: The left dataframe (the one considered “primary”).
  • right: The right dataframe (the one to be merged).
  • on: The column(s) used as the key for matching data.
  • how: The type of join (inner, outer, left, right).

Joining with pd.concat:

pd.concat is typically used for joining dataframes vertically (stacking them) rather than horizontally (merging them). However, you can achieve specific joins using:

  • axis=1: Joins dataframes horizontally (useful for related data with different keys).
  • join: Similar to how in pd.merge, specifies the type of join (inner, outer).

Exploring Join Types:

  • Inner join: Keeps rows only where keys exist in both dataframes.
  • Outer join: Keeps all rows from both dataframes, filling missing values with NaN.
  • Left join: Keeps all rows from the left dataframe and matching rows from the right.
  • Right join: Keeps all rows from the right dataframe and matching rows from the left.

Real-World Examples:

  • Merging customer data with purchase history.
  • Joining product information with customer reviews.
  • Combining financial data from multiple sources.

Tips and Tricks:

  • Use descriptive column names to avoid confusion.
  • Check for duplicate rows after merging.
  • Handle missing values appropriately.
  • Explore advanced merge options like suffixes for differentiating columns.

Practice Makes Perfect:

The best way to master these techniques is through practice. Start with small datasets and experiment with different types of joins and merge options. Don’t be afraid to get creative and explore the power of combining data to unlock valuable insights!

Additional Resources:

Remember, merging and joining data in Pandas is a powerful tool that can unlock the hidden potential within your data. By understanding the basics, experimenting, and practicing, you’ll be well on your way to becoming a data wrangling champion!

Hello again! You’ve learned a lot about pandas, that it’s a powerful library that makes working with tabular data easier and more efficient, how to select and index
data in a dataframe, how to filter data using Boolean masks, and how to group and aggregate data to derive insights from
the story it’s telling. In this video, you’ll learn how to add new data to existing dataframes. This is a common task
for data professionals, but it’s not as simple as just adding two dataframes together. There are important
considerations to be aware of. By the end of this lesson, you’ll have a good understanding of what these considerations are so you can make informed decisions about how best to add
data to your project. We’re going to learn about
two pandas functions: concat and merge. There’s considerable overlap between the capabilities of these functions, but it’s most important that you learn the basics of each because you will encounter them
regularly as a data professional. We’ll start with the concat function. Recall that to concatenate means to link or join together. The pandas concat function combines data either by adding it horizontally as new columns for existing rows, or vertically as new rows
for existing columns. It’s also capable of handling many data-specific complexities that arise, which allows for a high
degree of user control. In this video, I’ll demonstrate how to
use the concat function to add new rows to existing columns, but remember, there’s plenty
of support documentation if you’d like additional information. Pandas has a specific way to indicate which way we want the
data to be concatenated. We do this by referring to axes. In fact, many pandas and NumPy functions have an “axis” keyword so you can specify whether you want to apply the function
across rows or down columns. The two axes of a dataframe are zero, which runs vertically over rows; and one, which runs
horizontally across columns. We’ll use our basic planets dataset to demonstrate how concat works. This data has four planets, their radii, and their number of moons, but it’s missing the data for Jupiter, Saturn, Uranus, and Neptune. Now suppose we want to add this data, which exists as a separate dataframe. Let’s examine this second dataset with information about
Jupiter, Saturn, Uranus, and Neptune before joining them. Notice that this data
is in the same format as the data in the df1 dataframe. It has the same columns for
planet, radius, and moons. To combine the two dataframes, we’ll want to add df2
as new rows below df1. To concatenate the first dataset with information about
Mercury, Venus, Earth, and Mars with the second, which has information about Jupiter, Saturn, Uranus, and Neptune, we call PD concat and insert a list of the dataframes we want to concatenate. Then we need to include
an axis keyword argument. This instructs the function to combine the data either side-by-side or one on top of the other. We want our resulting dataframe to have eight rows and three columns, which means we want to combine the data vertically. In other words, we want to add new data
by extending axis zero, the vertical axis. Perfect! The data was added as new rows. Notice that each row retains its index number from
its original dataframe. If you want the numbering to restart, just reset the index. We include the “drop equals true” argument because otherwise a new index column will be added to the dataframe, which we don’t want in this case. Now the enumeration of the row indices goes from zero to seven. The concat function is great for when you have dataframes containing identically formatted data that simply needs to be combined vertically. If you want to add data horizontally, consider the merge function. The merge function is a pandas function that joins two dataframes together. It only combines data by extending along axis one horizontally. Let’s return to the planets. Now we have the radius and number of moons for all eight planets, but suppose we want to add the data for the planet type, whether it has rings, its average temperature, whether it has a magnetic field, and whether it has life on it. Perhaps this data exists
as a separate dataframe, but it’s missing Mercury and Venus and it has some recently
discovered planets from other star systems,
Janssen and Tadmor. That’s okay. We can still work with this. First, let’s conceptualize how data joins work. For two datasets to connect, they need to share a
common point of reference. In other words, both datasets must have
some aspect of them that is the same in each one. These are known as keys. Keys are the shared points of reference between different dataframes, what to match on. In our case, the keys are the planets. Each dataframe contains
planets for us to match on. Now let’s consider the different ways that we can join this data. We can join it so only the keys that are in both dataframes get included in the merge. This is known as an inner join. Alternatively, we can join the data so all of the keys from both dataframes get included in the merge. This is known as an outer join. We can also join the
data so all of the keys in the left dataframe are included, even if they aren’t in
the right dataframe. This is called a left join. Finally, we can join
the data so all the keys in the right dataframe are included, even if they aren’t in
the left dataframe. This is called a right join. Let’s examine how each type of join affects our planet data. First we’ll call the function and enter df3 and df4 as the left and right positional
arguments, respectively. Then we include the keyword argument “on,” which lets us specify what our keys to match on should be. In this case, we want to
use the “planet” column. Now we have the “how” keyword argument. This is where we enter
the kind of join we want. Let’s try “inner” first. This merged the data and only kept the planets that appeared in both dataframes. This means we’re missing data for Mercury and Venus from the left dataframe as well as for Janssen and Tadmor from the right dataframe. Now let’s try an outer join. Our function call will remain the same except for the
“how” keyword argument, which we’ll set to “outer.” As expected, this results in a dataframe that contains all the keys from both initial dataframes. Notice that, because Janssen and Tadmor aren’t in the left dataframe, they don’t have information
for radius and moons, so these columns get filled in with NaNs. Similarly, because Mercury and Venus aren’t in the right dataframe, they too are missing some information in the final table, which is represented by NaNs. Next we’ll do a left join. Again, the function gets the same syntax except for the “how” argument, which is set to “left.” This results in a dataframe that retains all the keys from the left dataframe and only the keys from
the right dataframe that exist in the left dataframe too. So Janssen and Tadmor are excluded. Finally, we’ll perform a right join. As expected, the result is a dataframe that has all the keys
from the right dataframe, but none of the keys from the left that weren’t also in the right. So Mercury and Venus are excluded. Nice job! Now that you know the fundamentals, you can use these pandas tools to do the most common kinds of data joins, which will be useful for a wide variety of data projects. And as you advance in your career, you’ll discover even
more about joining data, and how it can get very complex. These tools will be a big help as you do. You’ve come a long way and are now ready to start using pandas to explore your data like
a true data professional. See you soon!

Lab: Exemplar: Dataframes with pandas

Practice Quiz: Test your knowledge: Dataframes with pandas

Fill in the blank: In pandas, a _____ is a one-dimensional, labeled array.

In pandas, what is Boolean masking used for?

What is a pandas method that groups rows of a dataframe together based on their values at one or more columns?

A data professional wants to join two dataframes together. The dataframes contain identically formatted data that needs to be combined vertically. What pandas function can the data professional use to join the dataframes?

Review: Data structures in Python


Video: Wrap-up

Major Learnings:

  • Data Structures: Storing, accessing, and organizing data effectively using lists, tuples, dictionaries, sets, and arrays.
  • NumPy: Utilizing its computational power for rapid data processing and calculations.
  • Pandas: Analyzing tabular data through tasks like filtering, grouping, and merging.

Significance:

  • Understanding data structures is crucial for efficient data analysis.
  • NumPy and pandas are essential tools for data professionals.
  • Pandas will be your companion throughout the program and future career.

Next Steps:

  • Prepare for the graded assessment by reviewing new terms, videos, and readings.
  • Feel free to revisit any resource to solidify your understanding.

Overall Message:

Congratulations on mastering this section! You’ve built a strong foundation in Python that will empower you to succeed as a data professional. Keep up the fantastic progress!

This is the end of the fourth section of the Python course. You now have a strong
foundation of Python skills that you can continue to build on throughout your future career
as a data professional. In this section of the course, you learned how data
professionals use data structures to store, access, and organize their data. Understanding which data structures fit your specific task is
a key part of data work, and will help you analyze your data with speed and efficiency. We’ve reviewed fundamental data structures that are super useful
for data professionals: lifts, tuples, dictionaries,
sets, and arrays. We also discussed two
of the most widely-used and important Python tools
for advanced data analysis. The first was NumPy, which data professionals use
for its computational power. You learned how NumPy can
help you rapidly process large amounts of data and
perform useful calculations. The second Python tool you
learned about was pandas, which is a powerful tool
for analyzing tabular data. You learn how pandas can
help you perform key tasks such as filtering,
grouping, and merging data. Data professionals often
work with tabular data. You’ll use pandas throughout the rest of the certificate program…
and your future career. Coming up, you have a graded assessment. To prepare, review the reading that lists all the new terms you’ve learned. And feel free to revisit videos, readings, and other resources
that cover key concepts. Congratulations on all your progress. Way to go!

Reading: Glossary terms from module 4

Terms and definitions from Course 2, Module 4

Quiz: Module 4 challenge

Fill in the blank: In Python, _____ indicate where a list starts and ends.

A data professional is working with a list named cities that contains data on global cities. What Python code can they use to add the string ‘Tokyo’ to the end of the list?

Which of the following statements accurately describe Python tuples? Select all that apply.

Which of the following statements accurately describe Python dictionaries? Select all that apply.

A data professional is working with a dictionary named employees that contains employee data for a healthcare company. What Python code can they use to retrieve only the dictionary’s values?

A data professional is working with two Python sets. What function can they use to find the elements present in one set, but not the other?

Fill in the blank: In Python, _____ typically contain a collection of functions and global variables.

Fill in the blank: A _____ NumPy array can be created from a list of lists, where each internal list is the same length.

A data professional is working with a pandas dataframe named sales that contains sales data for a retail website. They want to know the price of the least expensive item. What code can they use to calculate the minimum value of the Price column?

A data professional is working with a pandas dataframe. They want to select a subset of rows and columns by index. What method can they use to do so?

A data professional wants to merge two pandas dataframes. They want to join the data so all of the keys in the left dataframe are included—even if they are not in the right dataframe. What technique can they use to do so?