Now, you’ll explore fundamental data structures such as lists, tuples, dictionaries, sets, and arrays. Lastly, you’ll learn about two of the most widely used and important Python tools for advanced data analysis: NumPy and pandas.
Learning Objectives
- Explain how to manipulate dataframes using techniques such as selecting and indexing, boolean masking, grouping and aggregating, and merging and joining
- Describe the main features and methods of core pandas data structures such as dataframes
- Describe the main features and methods of core NumPy data structures such as arrays and series
- Define Python tools such as libraries, packages, modules, and global variables
- Describe the main features and methods of built-in Python data structures such as lists, tuples, dictionaries, and sets
- List and tuples
- Welcome to module 4
- Lab: Annotated follow-along guide: Data structures in Python
- Video: Introduction to lists
- Video: Modify the contents of a list
- Reading: Reference guide: Lists
- Video: Introduction to tuples
- Reading: Compare lists, strings, and tuples
- Video: More with loops, lists, and tuples
- Reading: zip(), enumerate(), and list comprehension
- Lab: Exemplar: Lists & tuples
- Practice Quiz: Test your knowledge: Lists and tuples
- Dictionaries and sets
- Video: Introduction to dictionaries
- Video: Dictionary methods
- Reading: Reference guide: Dictionaries
- Video: Introduction to sets
- Reading: Reference guide: Sets
- Lab: Exemplar: Dictionaries & sets
- Practice Quiz: Test your knowledge: Dictionaries and sets
- Arrays and vectors with Numpy
- Video: The power of packages
- Video: Introduction to NumPy
- Creating NumPy arrays
- Working with NumPy arrays
- Conclusion
- Reading: Understand Python libraries, packages, and modules
- Video: Basic array operations
- Reading: Reference guide: Arrays
- Exemplar: Arrays and vectors with NumPy
- Practice Quiz: Test your knowledge: Arrays and vectors with NumPy
- Dataframes with pandas
- Video: Introduction to pandas
- Video: pandas basics
- Reading: The fundamentals of pandas
- Video: Boolean masking
- Reading: Boolean masking in pandas
- Video: Grouping and aggregation
- Reading: More on grouping and aggregation
- Video: Merging and joining data
- Lab: Exemplar: Dataframes with pandas
- Review: Data structures in Python
- Video: Wrap-up
- Reading: Glossary terms from module 4
- Quiz: Module 4 challenge
List and tuples
Welcome to module 4
The text highlights the learner’s progress in their Python learning journey. It emphasizes the learner’s ability to use variables, data types, functions, operators, conditional statements, and loops. The text introduces the concept of data structures and their importance in data analysis. It also discusses two crucial libraries for data professionals: NumPy and pandas. The text concludes by inviting the learner to continue their learning journey in the next video.
Hello again! You’ve come so far on
your learning journey! Just think of all the new Python skills you’ve developed along the way. You’ve learned how to use variables to store and label your data, and how to work with different data types, such as integers, floats, and strings. You can call functions to
perform actions on your data, and use operators to compare values. You also know how to write clean code that can be easily understood and reused by other data professionals. You can write conditional statements to tell the computer how to make decisions based on your instructions. And recently, you learned how to use loops to automate repetitive tasks. Coming up, we’ll explore data structures, which are collections of data values or objects that contain
different data types. Data professionals use data structures to store, access, organize, and categorize their data with speed and efficiency. Knowing which data structures fit your specific task is a
key part of data work, and will help streamline your analysis. We’ll focus on data structures that are among the most useful for data professionals: lists, tuples, dictionaries,
sets, and arrays. Part of what makes Python such a powerful and versatile programming language are the libraries and packages that are available to it. After we review fundamental
data structures, we’ll discuss two of the
most important libraries and packages for data professionals. The first is Numerical Python, or NumPy, which is known for its
high-performance computational power. Data professionals use NumPy to rapidly process large
quantities of data. I often use NumPy in my job because it’s so useful for analyzing large and complex datasets. The second is Python
Data Analysis Library, or pandas, which is a key tool for advanced data analytics. Pandas makes analyzing data in the form of a table with rows and columns easier and more efficient, because it has tools specifically
designed for the job. When you’re ready, I’ll
meet you in the next video!
Lab: Annotated follow-along guide: Data structures in Python
Video: Introduction to lists
In this video, the differences between data types and data structures are discussed, with a focus on the list as a specific kind of data structure in Python. Data types are attributes describing data based on values, programming language, or operations it can perform. Data structures are collections of data values, and lists, in particular, help store and manipulate ordered collections of items.
Lists, like strings, are sequences that allow duplicate elements, indexing, and slicing. Lists, however, are mutable, meaning their elements can be modified, added, or removed, whereas strings are immutable. The tutorial explains how to access elements in a list using indices and perform slicing to create subsets of the list.
The concept of mutability and immutability is highlighted, emphasizing that lists are mutable, allowing changes to their internal state. Practical examples are provided, such as checking if a list contains a specific element using the “in” keyword to generate a Boolean statement.
The tutorial underscores the usefulness of lists for organizing and categorizing related data, performing operations on multiple values simultaneously, and simplifying code. The audience is encouraged to stay tuned for more insights into working with lists.
Introduction to Lists in Python
Lists are a fundamental data structure in Python, used to store collections of items. They are versatile and can hold various data types, including integers, floats, strings, and even other lists. Lists are mutable, meaning their contents can be modified after creation.
Creating Lists
Creating a list in Python is straightforward. Use square brackets ([]
) and enclose the items separated by commas. For example:
Python
my_list = [1, 2, 3, 4, 5]
print(my_list)
This code creates a list named my_list
containing the numbers 1 to 5. The print
statement displays the list’s contents.
Accessing List Elements
Elements in a list are accessed using their index, which starts from 0. Positive indexes refer to elements from the beginning of the list, while negative indexes start from the end. For example:
Python
print(my_list[0]) # Output: 1
print(my_list[-1]) # Output: 5
Modifying Lists
Lists are mutable, allowing you to change their contents after creation. You can add, remove, or modify elements using various methods. For example:
Python
my_list.append(6) # Adds 6 to the end of the list
my_list.remove(2) # Removes the element with value 2
my_list[1] = 7 # Replaces the second element with value 7
List Operations
Python provides various operations for manipulating lists, such as:
len(my_list)
: Returns the length of the listmy_list.sort()
: Sorts the list in ascending ordermy_list.reverse()
: Reverses the order of elements in the listmy_list + another_list
: Concatenates two lists
List Slicing
List slicing allows you to extract a sublist from a list. Slicing uses the colon (:
), where the start index (optional), end index (optional), and step (optional) are specified. For example:
Python
sublist = my_list[1:4] # Extracts elements from index 1 (inclusive) to 4 (exclusive)
print(sublist) # Output: [2, 3, 4]
List Comprehensions
List comprehensions provide a concise way to create lists based on an expression. They use the for
loop and conditional statements within square brackets. For example:
Python
squares = [x * x for x in range(1, 6)] # Creates a list of squares from 1 to 5
print(squares) # Output: [1, 4, 9, 16, 25]
Conclusion
Lists are a powerful and versatile data structure in Python, essential for organizing and managing collections of data. Their flexibility and ease of use make them a valuable tool for various programming tasks.
Great to be with you again. In this video, we’ll
discuss the differences between data types and data structures. Then, we’ll explore lists, which are a specific
kind of data structure. As you’ve learned, a
data type is an attribute that describes a piece of
data based on its values, its programming language, or
the operations it can perform. In the context of Python, this includes the classes, integers, strings, floats, and Boolean
expressions, among others. A data structure is a
collection of data values, or objects that can contain
different data types. So, data structures can
contain data type elements, such as a float or a string. Data structures also enable
more efficient storage, access, and modifications. They allow you to organize
and categorize your data, relate collections of data to each other, and perform operations accordingly. One data structure in Python is a list. A list is a data structure
that helps store, and if necessary, manipulate
an ordered collection of items, such as a list of email addresses associated with a user account. A list is a lot like a string, and you can do many of the
same things with lists. For example, both strings and lists allow duplicate elements, as well as, indexing and slicing. Additionally, both are sequences. A sequence is a positionally
ordered collection of items. However, strings are
sequences of characters, while lists store sequences
of elements of any data type. There is some other key differences between lists and strings. First, note that different data structures are either mutable or immutable. Mutability refers to the ability to change the internal state of a data structure. Immutability is the reverse, where a data structure,
or element’s values can never be altered or updated. Lists and their contents are mutable, so their elements can be
modified, added, or removed. But strings are immutable. Think of a list like a long box, with a space inside, divided
into different slots. Each slot contains a value, and each value can store data. This could be another data structure, such as another list, or an integer, string, float, or output from another function. When working with lists, you use an index to access each of the elements. Recall that an index provides
the numbered position of each element in an ordered sequence. In this case, our sequence is a list. Let’s go through an
indexing example together. First, assign the following list of words to a variable X: “Now,” “we,” “are,” “cooking,” “with,” “seven,” “ingredients.” In Python, we use square
brackets to indicate where the list starts and ends. And commas to separate each
element contained in it. To print an element of a
list, use it’s index number. So, to print the word “cooking,” print the third element
of the list variable X. This is just like focusing
on a specific character or substring in a string. The first element in a list,
as with strings, is zero. So, if we print the element
with slot number three, we get the item, or word “cooking,” from our list of seven words. Remember that indexing
always starts at zero. So, if we had typed seven, to try to access the
last word of our list, we’d get an IndexError. We can also use indices to
create a slice of the list. For this, use ranges of two
numbers, separated by a colon… to get the second and third
words of our list: “we” and “are.” You can also use a colon, too, to get all words until the index slot two… We get “now” and “we”, which have index slots zero and one. And to leave one of the range
indexes empty, use two colon. This will give us the
other part of the list. So, just as with string indexing, the first value for the
first element in the list defaults to zero. And the second value, if left empty, defaults to the length of the list. To check if a list of words
contains a certain element, like the word “this”,
use the key word “in” to generate a Boolean statement. This verifies whether the word exists. The result of this check is a Boolean, which we can then use as a condition for branching or looping
in the rest of the code. Lists are very useful
when you’re working with many related values. They enable you to keep
the right data together, simplify your code, and
perform the same operations on multiple values at once. Coming up, we have even more on lists.
Fill in the blank: Mutability refers to the ability to _____ the internal state of a data structure.
change
Mutability refers to the ability to change the internal state of a data structure. Immutability is the reverse, where a data structure or element’s values can never be altered or updated.
Video: Modify the contents of a list
This video continues the discussion of lists in Python, focusing on modifying their contents. The append()
method adds an element to the end of a list, while the insert()
method inserts an element at a specific index. The remove()
method removes an element from the list, and the pop()
method removes and returns an element from a specific index.
The text also discusses the difference between mutable and immutable data types. Strings are immutable, meaning that their contents cannot be changed after creation. Lists, on the other hand, are mutable, meaning that their contents can be changed after creation.
The video concludes with a snack break and a promise to continue the discussion of lists in the next video.
Essential Operations for Modifying Lists in Python
Manipulating lists is a fundamental aspect of programming in Python. Lists are mutable data structures, meaning their contents can be altered after creation. This flexibility makes them invaluable for storing and managing collections of data.
1. Adding Elements to Lists: The append() Method
The append()
method is the most straightforward way to add elements to a list. It inserts the specified element to the end of the list. The syntax is simple:
Python
list_name.append(element_to_add)
For instance, consider a list of fruits:
Python
fruits = ["apple", "banana", "orange"]
To add “mango” to the list, use the append()
method:
Python
fruits.append("mango")
The updated list will be:
Python
fruits = ["apple", "banana", "orange", "mango"]
2. Inserting Elements into Lists: The insert() Method
The insert()
method provides more precise control over element placement. It inserts the specified element at a particular index within the list. The syntax is:
Python
list_name.insert(index, element_to_insert)
Here’s an example of inserting “kiwi” at index 1:
Python
fruits.insert(1, "kiwi")
The updated list will be:
Python
fruits = ["apple", "kiwi", "banana", "orange", "mango"]
3. Removing Elements from Lists: The remove() and pop() Methods
The remove()
method eliminates the first occurrence of a specified element from the list. Its syntax is:
Python
list_name.remove(element_to_remove)
For example, to remove “banana” from the list:
Python
fruits.remove("banana")
The updated list will be:
Python
fruits = ["apple", "kiwi", "orange", "mango"]
The pop()
method, on the other hand, removes and returns the element at a specific index. Its syntax is:
Python
removed_element = list_name.pop(index)
For instance, to remove and store the element at index 2:
Python
removed_fruit = fruits.pop(2)
This will remove “orange” from the list and store it in the variable removed_fruit
. The updated list will be:
Python
fruits = ["apple", "kiwi", "mango"]
4. Replacing Elements in Lists:
To replace an element at a specific index, use assignment:
Python
list_name[index] = new_element
For example, to replace “kiwi” with “pineapple”:
Python
fruits[1] = "pineapple"
The updated list will be:
Python
fruits = ["apple", "pineapple", "mango"]
5. Clearing Lists: The clear() Method
The clear()
method removes all elements from a list, effectively emptying it. Its syntax is:
Python
list_name.clear()
Applying this to the fruits list will result in:
Python
fruits = []
Conclusion
Modifying lists in Python is essential for data manipulation and organization. The append()
, insert()
, remove()
, pop()
, and clear()
methods provide powerful tools for adding, removing, and replacing elements, making lists versatile data structures.
In this video, we’ll
continue with lists. You’ll learn how to modify
the contents of a list. This will give you greater
control over your lists because you can add, remove, and change the items that they contain. Previously, we thought about
a list as a box divided into different slots. Modifying it means we keep
the box, but we add, remove, or change what’s inside. When thinking about modifying lists, there are a few methods that can be used. We’ll begin with the append method. The append method adds an
element to the end of a list. This requires one argument because this function
adds the incoming element to the end of the list
as a single new entry. You can even start with an empty list and all new elements
will be added at the end. Let’s explore an example. We’ll begin by typing a list of fruits: Upon further inspection,
it seems we forgot to add kiwi to the list. So, we can use the
append method to add it. This uses one parameter; in
this case, the string kiwi. Another common method for
modifying lists is insert, which requires two arguments: the index number of the
element to be modified, and the contents being put in that slot, such as a string or integer. Let’s investigate how insert works. Insert is a function that takes an index as the first parameter and an element as the second parameter,
then inserts the element into a list at the given index. Returning to our list of fruits… Now, orange is inserted at
the second spot at index one in our fruit list. Let’s add an element at
the beginning of the list. In the first parameter, we’ll put zero. Then, type mango as the second element. To remove an element from a list, let’s consider the remove method. Remove is a method that
removes an element from a list. Similar to the append method, remove only requires one parameter. Now our fruit list no longer has a banana. If we try to remove an element
that is not in the list, like strawberries, for
example, we get a ValueError. Another common way to remove elements is with the pop method, which uses an index. Pop extracts an element
from a list by removing it at a given index. So, to remove orange,
pop the third element in the list with index number two. Now, suppose that, after removing orange, you decide to also remove pineapple and replace it with mango. Simply reassign its value. Reference the pineapple
item’s index number, one, and replace it with mango. This renders the list without orange, because we already removed it, as well as mango, which
replaced pineapple. Our fruits have changed
a lot since we started. But it’s always the
same list, the same box. We’ve just modified what’s inside. At this point, I want to address something that new learners of
Python often wonder about. You’ll recall that strings are immutable and lists are mutable. What this means exactly
might not be clear at first. After all, didn’t we have
multiple videos about how to manipulate strings? A new example will help
make this more clear. Whenever we modify a
string, we always have to reassign the change
back to the variable that contained the string. This is overwriting the existing variable with a brand new one. Notice that we can’t, say, overwrite the character at index 0. We get an error. However, we can do this with a list. That’s why lists are considered mutable. Great work modifying lists. Now, after all that
talk about tasty fruit, I think we deserve a snack break! I’ll catch up with you again very soon.
Reading: Reference guide: Lists
Video: Introduction to tuples
Tuples are immutable sequences of data that are useful for storing information that needs to be processed together in the same structure. They are more secure than lists because they cannot be changed as easily. Tuples can be instantiated using parentheses or the tuple() function. They can also be used to return values from functions. Tuples are iterable, so we can extract information from them using loops. Using tuples in data professional work can help make your processes more efficient and save your team time and effort.
What are Tuples?
Tuples are a fundamental data structure in Python. They are immutable sequences of objects, meaning that their contents cannot be changed after creation. This makes them a secure way to store data that needs to remain constant. Tuples are often used to store collections of related data, such as the name, age, and city of a person.
Creating Tuples
Tuples are created using parentheses. For example, the following code creates a tuple containing the names of three fruits:
fruits = ("apple", "banana", "orange")
You can also use the tuple()
function to create a tuple from a list or other iterable:
numbers = [1, 2, 3, 4, 5]
numbers_tuple = tuple(numbers)
Accessing Elements of Tuples
Elements of a tuple are accessed using their index, which is a non-negative integer. The first element has index 0, the second element has index 1, and so on. For example, the following code accesses the second element of the fruits
tuple:
second_fruit = fruits[1]
print(second_fruit) # Output: banana
Immutability of Tuples
One of the key features of tuples is that they are immutable. This means that once a tuple is created, its contents cannot be changed. For example, the following code will raise an error:
fruits[0] = "grape" # This will raise an error
Operations on Tuples
There are a few basic operations that can be performed on tuples:
- Indexing: Accessing elements of a tuple using their index.
- Slicing: Extracting a sub-sequence from a tuple.
- Concatenation: Combining two tuples into a single tuple.
- Membership: Checking whether an element is present in a tuple.
- Length: Determining the number of elements in a tuple.
Comparison of Tuples
Tuples can be compared using the usual comparison operators (==
, !=
, <
, >
, <=
, >=
). Two tuples are considered equal if they have the same length and the corresponding elements are equal.
Applications of Tuples
Tuples are often used in the following situations:
- Storing constant data: When you need to store data that should not be changed, such as the names of months or the days of the week.
- Returning multiple values from functions: Tuples are a convenient way to return multiple values from functions.
- Storing data with positional significance: When the order of the data is important, such as the coordinates of a point in space.
Conclusion
Tuples are a versatile and powerful data structure in Python. Their immutability makes them a secure way to store data, and their ability to be used in various operations makes them a valuable tool for data manipulation. By understanding the basics of tuples, you can enhance your Python programming skills and tackle a wide range of programming tasks.
As a data professional, sometimes it will be
more important to access and reference data than to
change and manipulate it. When you simply need to
find some information, but keep the data intact, you can use a data
structure called a tuple. A tuple is an immutable sequence that can contain elements
of any data type. Tuples are kind of like lists, but they’re more secure because they cannot be changed as easily. They’re helpful because they keep data that needs to be processed
together in the same structure. Tuples are instantiated, or
expressed, with parenthesis or the tuple() function. Here we have a tuple that
represents someone’s full name. Notice that it was
instantiated using parentheses. The first element of the
tuple is their first name, the second is the first
letter of their middle name, and the third is their last name. The position of the
element is fixed in tuples, so you can’t add new
elements in the middle, and you can’t change any of the elements. If we try to change the last name, which lives in index number
2, from Hopper to Copper, the code will throw an error. You can add a value to the end, but only if you reassign the tuple. Another way we can create tuples is by using the tuple function to transform input into tuples. In this case, our name is
represented as a list. We can convert the list to a tuple by using the tuple function. Notice that the name is no longer a list, so it doesn’t have the brackets anymore. Tuples are also used to
return values and functions. In fact, when a function
returns more than one value, it’s actually returning a tuple. For example, here’s a function
that takes as an argument a float value that represents a price. The function returns the
number of dollars and cents. Notice that when we use
the function to convert $6 and 55 cents to dollars and cents, the return value is represented as a tuple that contains two numbers. Interestingly, even though
tuple are immutable, they can be split into separate variables. When we run the “to dollar cents” function, we can directly assign the
output into distinct variables. The information stored as a
tuple in the result variable has now been reassigned
to two separate variables that we can manipulate as we please. This process is known
as unpacking a tuple. Notice that the unpacked
variables themselves are no longer tuples. In this case, they’re integers. A big advantage of working with tuples is that they let you store
data of different types inside other data structures. Here’s an example of how
this might be useful. This is a list of the
starting five players on a university women’s basketball team. Each player is represented by a tuple that contains their
name, age, and position. This is a useful way of working with this type of information. The order of the players
doesn’t matter that much, and we might want to add
to or rearrange them. So we use a list, which is mutable. However, the players themselves
are individual records that are represented by tuples. They are a bit more secure
because tuples are immutable and more difficult to accidentally change. Because lists and tuples are iterable, we can extract information
from them using loops. For example, we can write a “for” loop that unpacks each tuple into
three separate variables, and then print one of the
variables for each iteration. This is equivalent to looping
over each player record and printing the record at index zero. Using tuples in data professional work helps make your processes more efficient. It saves memory and can really
optimize your programs too. Plus, when others collaborate
with you on your code, your use of tuples will
make it clear to them that your sequences of values are not intended to be modified. This is yet another great
way to save your team time and effort.
A tuple is an immutable sequence that can contain elements of any data type.
True
A tuple is an immutable sequence that can contain elements of any data type.
Reading: Compare lists, strings, and tuples
Reading
You’ve now learned about some of Python’s core iterable sequence data structures, including strings, lists, and tuples. These structures share many similarities, but there are some key differences between them. Data professionals must often decide which data structures work best to solve a particular problem, so understanding the relationship between these classes can help you make informed decisions in your work. This reading is a guide to the similarities and differences between strings, lists, and tuples.
Strings
Syntax/instantiation
Note: The following code block is not interactive.
- Single, double, or triple quotes:
empty_str = ''
my_string1 = 'minerals'
my_string2 = "martin"
my_string3 = """
marathon
golfcart
"""
Note: Using triple quotes to write a string over multiple lines will insert newlines (\n).
my_string3 = """
marathon
golfcart
"""
my_string3
- The str() function can be used for instantiation and conversion.
Note: The following code block is not interactive.
empty_str = str()
my_string = str(125)
Content
- Strings can contain any character—letters, numbers, punctuation marks, spaces—but everything between the opening and closing quotation marks is part of the same single string.
Mutability
- Strings are immutable. This means that once a string is created, it cannot be modified. Any operation that appears to modify a string actually creates a new string object.
Usage
- Strings are most commonly used to represent text data.
Methods
The Python string class comes packed with many useful methods to manipulate the data contained in strings. For more information on these methods, refer to Common String Operations in the Python documentation.
Lists
Syntax/instantiation
- Brackets, with each element separated by a comma:
Note: The following code block is not interactive.
empty_list = []
my_list = [1, 2, 3, 4, 5]
The list() function can be used for instantiation and conversion. Note that this function only works on iterable data types.
print(list('rocks'))
print(list(('stones', 'water', 'underground')))
Content
- Lists can contain any data type, and in any combination. So, a single list can contain strings, integers, floats, tuples, dictionaries, and other lists.
Note: The following code block is not interactive.
my_list = [1, 2, 1, 2, 'And through', ['and', 'through']]
Mutability
- Lists are mutable. This means that they can be modified after they are created.
num_list = [1, 2, 3]
num_list[0] = 5446
print(num_list)
Usage
- Lists are very versatile and therefore are used in numerous cases. Some common ones are:
- Storing collections of related items
- Storing collections of items that you want to iterate over: Because lists are ordered, you can easily iterate over their elements using a for loop or list comprehension.
- Sorting and searching: Lists can be sorted and searched, making them useful for tasks such as finding the minimum or maximum value in a list or sorting a list of items alphabetically.
- Modifying existing data: Because lists are mutable, they are useful for situations in which you know you’ll need to modify your data.
- Storing results: Lists can be used to store the results of a computation or a series of operations, making them useful in many different programming tasks.
Methods
- You can find methods for the Python list class in More on Lists in the Python documentation.
Tuples
Syntax/instantiation
- Parentheses, with each element separated by a comma:
Note: The following code block is not interactive.
empty_tuple = ()
my_tuple = (1, 'z')
Note: When using parentheses to declare a tuple with just a single element, you must use a trailing comma.
test1 = (1)
test2 = (2,)
print(type(test1))
print(type(test2))
No parentheses, but each element followed by a comma (even if there’s only one element):
tuple1 = 1,
tuple2 = 2, 3
print(type(tuple1))
print(type(tuple2))
- The tuple() function can be used for instantiation, and for conversion of iterable data types.
Note: The following code block is not interactive.
empty_tuple = tuple()
my_tuple = tuple([1, 'z'])
Content
- Tuples can contain any data type, and in any combination. So, a single tuple can contain strings, integers, floats, lists, dictionaries, and other tuples.
Note: The following code block is not interactive.
my_tuple = (1871, 'all', 'mimsy', ('were', 'the'), ['borogroves'])
Mutability
- Tuples are immutable. This means that once a tuple is created, it cannot be modified.
Usage
- Common uses of tuples include:
- Returning multiple values from a function
- Packing and unpacking sequences: You can use tuples to assign multiple values in a single line of code.
- Dictionary keys: Because tuples are immutable, they can be used as dictionary keys, whereas lists cannot. (You’ll learn more about dictionaries later.)
- Data integrity: Due to their immutable nature, tuples are a more secure way of storing data because they safeguard against accidental changes.
Methods
- Because tuples are built for data security, Python has only two methods that can be used on them:
- count() returns the number of times a specified value occurs in the tuple.
- index() searches the tuple for a specified value and returns the index of the first occurrence of the value.
Key takeaways
Strings, lists, and tuples are all iterable sequential data structures that share many similarities. They also have fundamental differences that you should be aware of so you can make effective choices in your work as a data professional. When selecting a data structure, consider its manner of instantiation, content, mutability, and the use case.
Resources:
- For more information about strings, refer to the Introduction to Python strings documentation.
- For more information about lists, refer to the Introduction to Python lists documentation.
- For more information about tuples, refer to the Python Standard Data Types tuples documentation
Video: More with loops, lists, and tuples
This video discusses more complex examples of how to work with loops, lists, and tuples in Python. It also introduces a few new tools that are useful for data professionals.
One example is a function that extracts the name and position of each player in a list of tuples. The function uses a for loop to unpack the tuples and then formats the data into a string.
Another example is a nested loop that creates all the different domino tiles in a set of dominoes. The inner loop generates the pips on the right side of the domino, and the outer loop generates the pips on the left side of the domino.
Finally, the video discusses list comprehensions, which are a concise way to create new lists from existing lists. A list comprehension is basically a for loop written in reverse, with the computation at the beginning and the “for” statement at the end.
The video encourages viewers to explore the notebook on their own and play with the code to learn more.
Hello again! In this video, we’ll consider more complex examples of different ways to work with loops, lists, and tuples. I’ll also introduce a few new tools that are useful for data professionals. Let’s return to the
women’s basketball team list of players from the previous video. In this example, we’ll integrate string formatting, loops, tuples, and lists. Remember, we have a list of tuples, where each tuple contains the name, age, and position of a player. Let’s define a function that extracts the name and position of each player into a list that we’ll use to print the information. We’ll call the function “player position,” and its argument will be a list of tuples that contain player information. Next, we’ll instantiate an empty list that will hold our result, which we’ll build as we
loop through the data. Now we’ll use a for loop to unpack the tuples in our list of players. The variables we assign in the for loop must align with the format of the tuples. There are three components
of each tuple in our list: name, age, and position, so we need three
variables in our for loop. If we tried to unpack the tuples using only two variables, like, “for name, age in players,” the computer would throw an error. It wouldn’t know what to do, because there are three
elements of the tuple but we’re only giving the computer two containers to put them in. So we’ll begin our for loop “for name, age, position in players.” Then we’ll use string formatting to append each name and position to the result list. Each string will include some positional formatting
and a new line too. Finally, we’ll call this
function in a for loop. This for loop will iterate over the results list that is output by the function and print each one. Now we have a nicely formatted, easily readable table of
players and positions. Here’s another application
of loops and lists. This is an example of nested loops. A nested loop is when you have one loop inside of another loop. These loops create all
the different domino tiles in a set of dominoes, which is a tile-based game played with numbered gaming pieces. Feel free to pause the video and try to figure out what’s happening… We start with the numbers on the left side of the dominoes. These numbers represent the dots, or pips, on the domino. They range from zero to six. For each number in this range, we’ll run another loop to generate the pips on the right side of the domino. Then we insert the left number and right number into a
formatted print statement. And here are the dominoes! Notice that in the first print statement we included a parameter called “end” whose value was a whitespace. When a print statement executes, by default it will end with a new line. So without this parameter, all the dominoes would have been printed in a vertical line, each one beneath the next. But when we set the end
character to a whitespace, it prints a space between
each domino instead. Here’s the same code, but instead of printing
the dominoes as strings, it stores each one as a tuple of integers in a list called “dominoes.” Now suppose we want to check the second number of
the tuple at index four. We can do that by using indexing. Start with the list we
want to access, “dominoes,” and put the index of the tuple we want to access in brackets. Then add another pair
of brackets containing the index of the value within that tuple. What if we want to calculate the total number of pips on each domino? We can do that with a for loop that iterates over each tuple, sums the value at index zero and the value at index one, and appends the sum to a list. But there’s a much
easier way of doing this. It’s called a list comprehension. A list comprehension formulaically creates a new list based on the values in an existing list. Here’s how it works. Begin by assigning a
variable for our new list. We’ll call this one “pips from list comp.” For its value, create an empty list. Now we basically write a for loop, only in reverse. We begin with the calculation that creates each element of the list. In this case, we want each element to be
the total number of pips on the domino, which is the domino at index zero plus the
domino at index one. Then we add a “for” statement. We can check to make sure it gave the same result as our for loop. They’re the same! Note what happened. This is why I said a list comprehension is like a for loop written in reverse. The “for” part of it is at
the end of the statement and the computation is at the beginning. Both the for loop and
the list comprehension do the same thing, but
the list comprehension is much more elegant, and usually faster to execute too. Hopefully by now you can appreciate how powerful the building
blocks of Python can be. I encourage you to explore this notebook on your own and play with the code to discover what happens when you add something here or change something there. Playing with code is one
of the best ways to learn! I’ll see you soon.
Reading: zip(), enumerate(), and list comprehension
Reading
You’ve learned much about iterable objects such as strings, lists, and tuples, and soon you’ll learn more. These objects comprise many of Python’s core data structures and, as a data professional, you’ll work with them constantly. While working in Python, you’ll often need to perform the same tasks and operations many times. This reading will introduce you to three time-saving tools: zip(), enumerate(), and list comprehension.
zip()
The zip() function is a built-in Python function that does what the name implies: It performs an element-wise combination of sequences.
The function returns an iterator that produces tuples containing elements from each of the input sequences. An iterator is an object that enables processing of a collection of items one at a time without needing to assemble the entire collection at once. Use an iterator with loops or other iterable functions such as list() or tuple(). Here’s an example:
cities = ['Paris', 'Lagos', 'Mumbai']
countries = ['France', 'Nigeria', 'India']
places = zip(cities, countries)
print(places)
print(list(places))
[('Paris', 'France'), ('Lagos', 'Nigeria'), ('Mumbai', 'India')]
Notice that, in this case, the list() function is used to generate a list of tuples from the iterator object. Here are a few things to keep in mind when using the zip() function.
- It works with two or more iterable objects. The given example zips two sequences, but the zip() function will accept more sequences and apply the same logic.
- If the input objects are of unequal length, the resulting iterator will be the same length as the shortest input.
- If you give it only one iterable object as an argument, the function will return an iterator that produces tuples containing only one element from that iterable at a time.
Unzipping
You can also unzip an object with the * operator. Here’s the syntax:
scientists = [('Nikola', 'Tesla'), ('Charles', 'Darwin'), ('Marie', 'Curie')]
given_names, surnames = zip(*scientists)
print(given_names)
print(surnames)
('Nikola', 'Charles', 'Marie')
('Tesla', 'Darwin', 'Curie')
Note that this operation unpacks the tuples in the original list element-wise into two tuples, thus separating the data into different variables that can be manipulated further.
enumerate()
The enumerate() function is another built-in Python function that allows you to iterate over a sequence while keeping track of each element’s index. Similar to zip(), it returns an iterator that produces pairs of indices and elements. Here’s an example:
letters = ['a', 'b', 'c']
for index, letter in enumerate(letters):
print(index, letter)
0 a
1 b
2 c
Note that the default starting index is zero, but you can assign it to whatever you want when you call the enumerate() function. For example:
letters = ['a', 'b', 'c']
for index, letter in enumerate(letters, 2):
print(index, letter)
2 a
3 b
4 c
In this case, the number two was passed as an argument to the function, and the first element of the resulting iterator had an index of two. The enumerate() function is useful when an element’s place in a sequence must be used to determine how the element should be handled in an operation.
List comprehension
One of the most useful tools in Python is list comprehension. List comprehension is a concise and efficient way to create a new list based on the values in an existing iterable object. List comprehensions take the following form:
my_list = [expression for element in iterable if condition]
In this syntax:
- expression refers to an operation or what you want to do with each element in the iterable sequence.
- element is the variable name that you assign to represent each item in the iterable sequence.
- iterable is the iterable sequence.
- condition is any expression that evaluates to True or False. This element is optional and is used to filter elements of the iterable sequence.
Here are some examples of list comprehensions:
This list comprehension adds 10 to each number in the list:
numbers = [1, 2, 3, 4, 5]
new_list = [x + 10 for x in numbers]
print(new_list)
[11, 12, 13, 14, 15]
In the preceding example, x + 10 is the expression, x is the element, and numbers is the iterable sequence. There is no condition.
This next list comprehension extracts the first and last letter of each word as a tuple, but only if the word is more than five letters long.
words = ['Emotan', 'Amina', 'Ibeno', 'Sankwala']
new_list = [(word[0], word[-1]) for word in words if len(word) > 5]
print(new_list)
[('E', 'n'), ('S', 'a')]
Note that multiple operations can be performed in the expression component of the list comprehension to result in a list of tuples. This example also makes use of a condition to filter out words that are not more than five letters long.
Key takeaways
zip(), enumerate(), and list comprehension make code more efficient by reducing the need to rely on loops to process data and simplifying working with iterables. Understanding these common tools will save you time and make your process much more dynamic when manipulating data.
Lab: Exemplar: Lists & tuples
Practice Quiz: Test your knowledge: Lists and tuples
Lists and their contents are immutable, so their elements cannot be modified, added, or removed.
False
Lists and their contents are mutable, so their elements can be modified, added, or removed. A list is a data structure that helps store and manipulate an ordered collection of items.
What Python method adds an element to the end of a list?
append()
Python’s append() method adds an element to the end of a list.
A data professional wants to instantiate a tuple. What Python elements can they use to do so? Select all that apply.
Parentheses, The tuple() function
A data professional can use parentheses or the tuple() function to instantiate a tuple. A tuple is an immutable sequence that can contain elements of any data type.
What Python technique formulaically creates a new list based on the values in an existing list?
List comprehension
A list comprehension formulaically creates a new list based on the values in an existing list. A list comprehension functions like a for loop, but is a more efficient and elegant way to create a new list from an existing list.
Dictionaries and sets
Video: Introduction to dictionaries
Dictionaries are a fundamental data structure in Python that store data in key-value pairs. They are versatile and widely used by data professionals to analyze large datasets and store user information. Dictionaries can be created using braces or the dict() function. Keys must be immutable, such as integers, floats, tuples, or strings. Dictionaries are unordered, meaning you cannot access them by index. Use the “IN” keyword to check if a key exists. Dictionaries are powerful and versatile, and we will explore more examples and tools in the next lesson.
Introduction to Dictionaries in Python
Dictionaries, also known as associative arrays, are a versatile data structure in Python that stores data in key-value pairs. Each key is unique and serves as an identifier for its corresponding value. Dictionaries are mutable, meaning their contents can be changed after they are created.
Creating a Dictionary
Dictionaries can be created using curly braces ({}
) or the dict()
function.
Using Curly Braces
Python
# Creating a dictionary using curly braces
animal_sounds = {'dog': 'bark', 'cat': 'meow', 'bird': 'chirp'}
print(animal_sounds)
Output:
{'dog': 'bark', 'cat': 'meow', 'bird': 'chirp'}
Using the dict()
Function
Python
# Creating a dictionary using the dict() function
employee_data = dict()
employee_data['name'] = 'Alice'
employee_data['age'] = 30
employee_data['role'] = 'Software Engineer'
print(employee_data)
Output:
{'name': 'Alice', 'age': 30, 'role': 'Software Engineer'}
Accessing Values
To access a value in a dictionary, use the key enclosed in square brackets ([]
).
Python
# Accessing values using keys
animal_sound = animal_sounds['dog']
print(animal_sound)
Output:
bark
Adding Key-Value Pairs
To add a new key-value pair to an existing dictionary, use the assignment operator (=
).
Python
# Adding a new key-value pair
animal_sounds['cow'] = 'moo'
print(animal_sounds)
Output:
{'dog': 'bark', 'cat': 'meow', 'bird': 'chirp', 'cow': 'moo'}
Modifying Values
To modify the value associated with an existing key, use the assignment operator (=
).
Python
# Modifying an existing value
employee_data['age'] = 32
print(employee_data)
Output:
{'name': 'Alice', 'age': 32, 'role': 'Software Engineer'}
Checking for Keys
To check if a key exists in a dictionary, use the in
operator.
Python
# Checking for a key
if 'name' in employee_data:
print('Key "name" exists in the dictionary')
Output:
Key "name" exists in the dictionary
Removing Key-Value Pairs
To remove a key-value pair from a dictionary, use the del
keyword.
Python
# Removing a key-value pair
del animal_sounds['cow']
print(animal_sounds)
Output:
{'dog': 'bark', 'cat': 'meow', 'bird': 'chirp'}
Iterating over Dictionaries
To iterate over the keys in a dictionary, use a for
loop.
Python
# Iterating over keys
for key in animal_sounds:
print(key)
Output:
dog
cat
bird
To iterate over both keys and values, use a for
loop with the .items()
method.
Python
# Iterating over key-value pairs
for key, value in animal_sounds.items():
print(f"Key: {key}, Value: {value}")
Output:
Key: dog, Value: bark
Key: cat, Value: meow
Key: bird, Value: chirp
Dictionaries are a powerful and versatile data structure that is essential for data manipulation in Python. They are widely used in various applications, including web development, data analysis, and machine learning.
Dictionaries are one
of the most widely used and important data structures in Python. A dictionary is a data structure that consists of a collection
of key-value pairs. They are instantiated with
braces or the dict () function. We’ll discuss that more shortly. Both veteran and entry
level data professionals use dictionaries to
analyze large data sets with fast processing power. This helps them gather and
transform user information. Dictionaries also provide a straightforward way to store data, making it easier for users
to find specific information. To use a regular dictionary,
not the data structure, but the actual book with
words and definitions, you look at the word, find
it, then read its definition. It’s essentially the same
with the Python dictionary. You look at the key, which
will let you access the values associated with that key. That’s what’s meant by “key-value pairs.” Here’s a simple example
to illustrate the concept. Suppose we have a zoo, and
the zoo has different pens that contain different animals. We could have a dictionary that stores this information for us, with the pen numbers as keys
and the animals as values. We could use this dictionary to look up which animals are in each pen. For example, if we want to know
what animals are in pen two, we type the name of the dictionary, zoo, followed by the pen in brackets. Accessing a dictionary this way always searches over the
keys and returns the values of the corresponding key. It doesn’t work the other way around. I can’t use indexing to search for zebras and find out their pen. I will get a key error. Because “zebras” is not a
key in the dictionary. Dictionaries are instantiated
mainly in two ways. The first way is with braces. With this approach, each key is separated from its value by a colon,
and each key-value pair is separated from the next by a comma. The second way to create a dictionary is with the dict function. When using the dict function, the syntax is a little different. When the keys are strings, you can type them as keyword arguments. The last time we made this dictionary, we used quotation marks to indicate that the keys were strings. Here we don’t because they’re keyword arguments. Also, instead of using
a colon between the keys and their values, you use an equal sign. The dictionary lookup is the same, irrespective of whether braces
or the dict() function is used. The dict function is also
a little more flexible in how it can be used. For example, we can create
this same dictionary once again by passing a list of lists as an argument, or even a list of tools, a tuple of tuples or a tuple of lists. They all give us the same result. If we want to add a new key-value pair to an existing dictionary, say to put crocodiles in pen four, we can assign it like this. A dictionary’s keys must be immutable. Immutable keys include,
but are not limited to, integers, floats, tuples and strings. Lists, sets, and other dictionaries are not included in this category, since they are mutable. Another important thing
to note about dictionaries is that they’re unordered. That means you can’t access them by referencing a positional index. If I try to access our zoo
dictionary at index two, I get a key error, because the computer is interpreting the two as a dictionary
key, not as an index. Also, because dictionaries are unordered, you’ll sometimes find that
the order of the entries can change as you’re working with them. If the order of your data is important, it’s better to use an ordered
data structure like a list. Finally, you can check if a keyword exists in your dictionary simply
by using the “IN” keyword. Note that this only works for keys. You can’t check for values this way. There’s a lot that we
can do with dictionaries and what we’ve reviewed
here is only the beginning. Next, we’ll consider more examples that show the power of dictionaries. You’ll also learn about some tools that make working dictionaries
easy and convenient. Meet you there.
Fill in the blank: A dictionary is a data structure that consists of a collection of _____ pairs.
key-value
A dictionary is a data structure that consists of a collection of key-value pairs. In a Python dictionary, looking up a key lets you access the data values associated with that key.
Video: Dictionary methods
Dictionaries are a cornerstone of Python’s data structures, offering a flexible and efficient approach to data organization and retrieval. Unlike lists, which rely on positional indices, dictionaries capitalize on key-value pairs, enabling rapid data lookup and manipulation.
Transforming a list of tuples into a dictionary involves iterating through the tuples using a for loop. Within each iteration, extract the key and value from the tuple and assign the key to the dictionary. For each key, maintain a list to store the corresponding values.
Python provides several powerful methods for interacting with dictionaries. The keys() method retrieves a list of all keys in the dictionary. Similarly, the values() method returns a list of all values. To access both keys and values simultaneously, utilize the items() method, which returns a list of tuples, each containing a key and its respective value.
Dictionaries stand as an indispensable tool for data analytics endeavors. Their ability to store, organize, and retrieve data efficiently makes them an integral component of the Python data science toolkit. As you delve deeper into the world of dictionaries, you’ll uncover their vast potential for data manipulation and analysis.
Exploring Dictionary Methods in Python
Dictionaries are fundamental data structures in Python that excel at storing and organizing data using key-value pairs. Unlike lists, which rely on positional indices, dictionaries offer a more flexible and intuitive approach to data management. They are widely used in various applications, including data analysis, web development, and machine learning.
Python provides a rich set of built-in methods for manipulating and interacting with dictionaries, making them even more powerful and versatile. These methods facilitate efficient data retrieval, modification, and addition, streamlining the process of data manipulation.
Essential Dictionary Methods for Data Management
- keys() method: The
keys()
method retrieves a list of all keys present in the dictionary, providing a concise overview of the dictionary’s contents.
Python
dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
keys = dictionary.keys()
print(keys)
Output:
['name', 'age', 'city']
- values() method: This method returns a list of all values associated with the dictionary’s keys, offering a clear view of the stored data.
Python
dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
values = dictionary.values()
print(values)
Output:
['Alice', 30, 'Seattle']
- items() method: The
items()
method retrieves a list of tuples, where each tuple contains a key and its corresponding value. This provides a combined view of the dictionary’s structure and data.
Python
dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
items = dictionary.items()
print(items)
Output:
[('name', 'Alice'), ('age', 30), ('city', 'Seattle')]
- get() method: This method takes a key as an argument and returns the value associated with that key. If the key does not exist, it returns
None
by default.
Python
dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
value = dictionary.get("name")
print(value)
value = dictionary.get("occupation")
print(value)
Output:
Alice
None
- setdefault() method: The
setdefault()
method takes a key and an optional default value as arguments. If the key exists, it returns the existing value. If the key does not exist, it adds the key to the dictionary and returns the default value.
Python
dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
value = dictionary.setdefault("name", "default_name")
print(value)
value = dictionary.setdefault("occupation", "default_occupation")
print(value)
print(dictionary)
Output:
Alice
default_occupation
{'name': 'Alice', 'age': 30, 'city': 'Seattle', 'occupation': 'default_occupation'}
- update() method: This method takes another dictionary as an argument and updates the current dictionary with the key-value pairs from the other dictionary.
Python
dictionary1 = {"name": "Bob", "age": 25, "city": "San Francisco"}
dictionary2 = {"occupation": "Software Engineer", "company": "Google"}
dictionary1.update(dictionary2)
print(dictionary1)
Output:
{‘name’: ‘Bob’, ‘age’: 25, ‘city’: ‘San Francisco’, ‘occupation’: ‘Software Engineer’, ‘company’: ‘Google’}
- copy() method: The
copy()
method creates a shallow copy of the dictionary. A shallow copy means that the new dictionary will have the same keys as the original dictionary, but the values will be references to the same objects as the original dictionary.
Python
dictionary = {"name": "Alice", "age": 30, "city": "Seattle"}
new_dictionary = dictionary.copy()
new_dictionary["name"] = "Charlie"
print(dictionary)
print(new_dictionary)
Output:
{'name': 'Alice', 'age': 30, 'city': 'Seattle'}
{'name': 'Charlie', 'age': 30, 'city': 'Seattle'}
These essential dictionary methods provide a solid foundation for mastering data manipulation in Python. By understanding and utilizing these methods
Previously, you were
introduced to dictionaries and learned a little
bit about how they work. Let’s continue our
exploration of dictionaries and how to use them. Let’s consider a previous example and revisit the women’s
basketball team roster. Recall that the roster was
encoded as a list of tuples. Each tuple represented the name, age, and position of a player on the team. The list of tuples was useful
when we had a single team and one player per position. What if we add more players
beyond the starting five? A dictionary can help us organize the data according to our specific needs. For example, what if we want
to be able to look up players by their position? We can create a dictionary
where the positions are the keys and the players are the values, each represented as a tuple
that contains two values: their name and age. We could retype the
information into a dictionary or cut and paste it, but if you find yourself
doing these things, you can take this as an opportunity to improve your coding skills. Consider that this is the
information for just 10 players; what if we had data for the whole league? We can convert this
information to a dictionary with a for loop and
some conditional logic. We’ll begin by instantiating
an empty dictionary and designing it to a
variable named “new_team.” The idea is to loop over
each tuple in the list, extract the position and
assign it as a dictionary key, and extract the player’s name and age and assign them as a tuple within a list, representing the value for that key. The process would repeat for
each iteration of the loop, until all of the players are
recorded in the dictionary. Notice that each position is
only represented once as a key. So the final dictionary has five keys, and there are two players in
the list at each key’s value. Now let’s write the loop. First, we’ll assign the empty dictionary to a variable called “new_team.” Then we’ll write a for loop
that unpacks the information in the original tuples. “For
name, age, position in team.” And here’s where the
conditional logic comes in. If the position already exists as a keyword in our dictionary, then we want to append
the name and age tuple to the list of tuples. Remember, the value for
each key will be a list that contains tuples
of player information. If the position does not already exist as a keyword in the dictionary,
we’ll have to assign it. We’ll use an else statement to do this. With only a few lines of code, we have converted our list
of tuples to a dictionary. Let’s check that it works. It sure does. Creating dictionaries this way is a common practice in
data analytics with Python, so learning this process
will help make you a more capable data professional. Now, let’s learn about some useful methods that we can use on dictionaries to really take advantage of their power. Specifically, we’ll
discuss the keys, values, and items methods. If you run a loop over a dictionary, the loop will only access
the keys, not the values. For example, if we loop over
the dictionary we just created and print each iteration, the computer will return five positions, the keys of the dictionary. But you don’t need to write a loop every time you want to access
the keys of your dictionary. That’s what the keys method is for. The keys method lets you retrieve only the dictionary’s keys. Returning to our “new_team” dictionary, when we apply the keys
method to the dictionary, the computer returns a
list of all its keys. Similarly, the values
method lets you retrieve only the dictionary’s values. When applied to our “new_team” dictionary, we get the values returned as a list. Since our values are lists of tuples, it means the result of calling this method is a list of lists of tuples. But what if you want to access both the keys and their values? You can, using the items
method, which lets you retrieve both the dictionary’s keys and values. Let’s use a loop to print
what the items method returns so the output is prettier. Dictionaries make storing and retrieving data fast and efficient. Keep exploring the many
things you can do with them. With time, you’ll find that
they become an important tool in your data analytics toolbox.
Reading: Reference guide: Dictionaries
Reading
By now you’ve encountered dictionaries and are discovering their power and utility as a data structure in Python. You’ve also learned that dictionaries provide a way to store and retrieve data using key-value pairs. Data professionals use dictionaries for many tasks, so it’s important to be familiar with how they work. This reading is a reference guide about dictionaries. It’s designed to help you in your Python learning journey.
Save this course item
You may want to save a copy of this guide for future reference. Use it as a resource for additional practice or in your future professional projects. To access a downloadable version of this course item, click the link below and select “Use Template.”
Create a dictionary
There are two main ways to create dictionaries in Python:
- Braces: {}
- The dict function: dict()
When instantiating a dictionary using braces, separate each element with a colon. For example, the following code creates a dictionary containing continents as keys and their smallest countries as values:
Note: The following code block is not interactive.
smallest_countries = {'Africa': 'Seychelles',
'Asia': 'Maldives',
'Europe': 'Vatican City',
'Oceania': 'Nauru',
'North America': 'St. Kitts and Nevis',
'South America': 'Suriname'
}
To create an empty dictionary, use empty braces or the dict() function:
Note: The following code block is not interactive.
empty_dict_1 = {}
empty_dict_2 = dict()
The dict() function uses a different syntax, where keys are entered as the function’s keyword arguments and values are assigned with an equals operator:
Note: The following code block is not interactive.
smallest_countries = dict(africa='Seychelles',
asia='Maldives',
europe='Vatican City',
oceania='Nauru',
north_america='St. Kitts and Nevis',
south_america ='Suriname'
)
Notice that, because the keywords cannot be entered as strings, they cannot contain whitespaces.
Some important notes about keys and values:
- Dictionary keys: Can be of any immutable data type, such as strings, numbers, or tuples
- Dictionary values: Can be of any data type—mutable or immutable—including other dictionaries or objects
- Each key can only correspond to a single value; so, for example, this will throw an error:
invalid_dict = {'numbers': 1, 2, 3}
Error on line 1:
invalid_dict = {'numbers': 1, 2, 3}
^
SyntaxError: invalid syntax
But if you enclose multiple values within another single data structure, you can create a valid dictionary. For example:
valid_dict = {'numbers': [1, 2, 3]}
print(valid_dict)
{'numbers': [1, 2, 3]}
Work with dictionaries
Access values
To access a specific value in a dictionary, you must refer to its key using brackets:
my_dict = {'nums': [1, 2, 3],
'abc': ['a', 'b', 'c']
}
print(my_dict['nums'])
[1, 2, 3]
To access all values in a dictionary, use the values() method:
my_dict = {'nums': [1, 2, 3],
'abc': ['a', 'b', 'c']
}
print(my_dict.values())
dict_values([[1, 2, 3], ['a', 'b', 'c']])
Assign new keys
Dictionaries are mutable data structures in Python. You can add to and modify existing dictionaries. To add a new key to a dictionary, use brackets:
my_dict = {'nums': [1, 2, 3],
'abc': ['a', 'b', 'c']
}
# Add a new 'floats' key
my_dict['floats'] = [1.0, 2.0, 3.0]
print(my_dict)
{'nums': [1, 2, 3], 'abc': ['a', 'b', 'c'], 'floats': [1.0, 2.0, 3.0]}
Check if a key exists in a dictionary
To check if a key exists in a dictionary, use the in keyword:
smallest_countries = {'Africa': 'Seychelles',
'Asia': 'Maldives',
'Europe': 'Vatican City',
'Oceania': 'Nauru',
'North America': 'St. Kitts and Nevis',
'South America': 'Suriname'
}
print('Africa' in smallest_countries)
print('Asia' not in smallest_countries)
True
False
Delete a key-value pair
To delete a key-value pair from a dictionary, use the del keyword:
my_dict = {'nums': [1, 2, 3],
'abc': ['a', 'b', 'c']
}
del my_dict['abc']
print(my_dict)
{'nums': [1, 2, 3]}
Dictionary methods
Dictionaries are a core Python class. As you’ve learned, classes package data with tools to work with it. Methods are functions that belong to a class. Dictionaries have a number of built-in methods that are very useful. Some of the most commonly used methods include:
items()
Return a view of the (key, value) pairs of the dictionary:
my_dict = {'nums': [1, 2, 3],
'abc': ['a', 'b', 'c']
}
print(my_dict.items())
dict_items([('nums', [1, 2, 3]), ('abc', ['a', 'b', 'c'])])
keys()
Return a view of the dictionary’s keys:
my_dict = {'nums': [1, 2, 3],
'abc': ['a', 'b', 'c']
}
print(my_dict.keys())
dict_keys(['nums', 'abc'])
Values()
Return a view of the dictionary’s values:
my_dict = {'nums': [1, 2, 3],
'abc': ['a', 'b', 'c']
}
print(my_dict.values())
dict_values([[1, 2, 3], ['a', 'b', 'c']])
Note that the objects returned by these methods are view objects. They provide a dynamic view of the dictionary’s entries, which means that, when the dictionary changes, the view reflects these changes. Dictionary views can be iterated over to yield their respective data. They also support membership tests.
Additional resources
- For more information about dictionaries, refer to the Python dictionary documentation.
- For more dictionary methods, refer to the Python mapping types documentation.
- For more information about view objects, refer to the Python dictionary view objects documentation.
Video: Introduction to sets
Sets in Python
Sets are data structures in Python that store unordered, non-interchangeable elements. Each element must be unique and immutable. Sets are mutable, meaning they can be changed after creation.
Creating Sets
Sets can be created using the set()
function or non-empty braces. The set()
function takes an iterable as an argument and returns a new set object.
Set Operations
Python provides built-in methods for performing common set operations, such as intersection, union, difference, and symmetric difference.
- Intersection: Finds the elements that two sets have in common.
- Union: Finds all the elements from both sets.
- Difference: Finds the elements present in one set, but not the other.
- Symmetric Difference: Finds elements from both sets that are mutually not present in the other.
Conclusion
Sets are a versatile data structure that can be used for a variety of tasks, such as storing unique values, removing duplicates from a list, and combining data from multiple sources.
What are Sets?
Sets are a fundamental data structure in Python that store collections of unique, unordered elements. They are similar to lists in that they can store multiple values, but unlike lists, sets cannot contain duplicate elements. Additionally, sets are mutable, meaning that they can be changed after creation.
Creating Sets
There are two primary ways to create sets in Python:
- Using the
set()
function: Theset()
function takes an iterable as an argument and returns a new set object containing the unique elements from the iterable. For example:
Python
my_set = set([1, 2, 3, 4, 5])
print(my_set) # Output: {1, 2, 3, 4, 5}
- Using curly braces: Curly braces can be used to create sets by enclosing the elements within the braces. For example:
Python
my_set = {1, 2, 3, 4, 5}
print(my_set) # Output: {1, 2, 3, 4, 5}
Common Set Operations
Python provides built-in methods for performing common set operations, such as:
- Union: The union of two sets is the collection of all unique elements from both sets. It is represented by the
|
operator. For example:
Python
set1 = {1, 2, 3}
set2 = {4, 5, 6}
set3 = set1 | set2
print(set3) # Output: {1, 2, 3, 4, 5, 6}
- Intersection: The intersection of two sets is the collection of elements that are common to both sets. It is represented by the
&
operator. For example:
Python
set1 = {1, 2, 3}
set2 = {4, 5, 6}
set3 = set1 & set2
print(set3) # Output: set()
- Difference: The difference of two sets is the collection of elements that are in the first set but not in the second set. It is represented by the
-
operator. For example:
Python
set1 = {1, 2, 3}
set2 = {4, 5, 6}
set3 = set1 - set2
print(set3) # Output: {1, 2, 3}
- Symmetric Difference: The symmetric difference of two sets is the collection of elements that are in one set but not in the other set. It is represented by the
^
operator. For example:
Python
set1 = {1, 2, 3}
set2 = {4, 5, 6}
set3 = set1 ^ set2
print(set3) # Output: {1, 2, 3, 4, 5, 6}
Additional Set Methods
add(element)
: Adds an element to the set.remove(element)
: Removes an element from the set.pop()
: Removes and returns an arbitrary element from the set.clear()
: Removes all elements from the set.isdisjoint(set2)
: Checks whether the set has no elements in common with the specified set.issubset(set2)
: Checks whether the set is a subset of the specified set.issuperset(set2)
: Checks whether the set is a superset of the specified set.
Conclusion
Sets are a versatile and powerful data structure in Python that can be used for a variety of tasks, including storing unique values, removing duplicates from a list, and performing set operations. Understanding the basics of sets is essential for any Python programmer.
Welcome back! In this video,
we’re going to discover sets. A set is a data structure in Python that contains only unordered,
non-interchangeable elements. Sets are instantiated
with the set() function or non-empty braces. Each set element is unique and immutable. However, the set itself is mutable. Sets are valuable when storing mixed data in a single row, or a
record, in a data table. They’re also frequently used
when storing a lot of elements, and you want to be certain that each one is only present once. Because sets are mutable, they cannot be used as
keys in a dictionary. There are two ways to create a set. The first way is with the set function. The set function takes an
iterable as an argument and returns a new set object. Let’s examine the behavior of sets by passing lists, tuples, and strings through the set function. To turn the list containing “foo, bar,
baz, and foo” into a set, pass a list through the set function. And notice, the list loses the second foo. As I’ve mentioned, each
element must be unique in sets. Pass a tuple through the set function using two sets of parentheses:
one to tell the computer that the data we are
working with is a tuple; the other because the set function only takes a single argument. Again, the same result, only one foo element can
be present in the set. Finally, pass a string
through the set function. It doesn’t return the string, just the singular occurrence
of the letters in the string, O and F, in an unordered way. This is because the set function
accepts a single argument, and that argument must be iterable. A string is iterable, so
the set function splits it into individual characters and
keeps only the unique ones. The second way to instantiate
a set is with braces. However, you have to put
something inside the braces. Otherwise, the computer will
interpret your empty braces as a dictionary. Note here that instantiating
a set with braces treats what’s inside
the braces as literals. So when instantiating a set of only a single string using braces, it returns a set with a single element, and the element is the string itself. Remember, to define an
empty set or a new set, it is best to use the set function. You can only use curly braces
when the set is not empty and you are assigning
the set to a variable. Also, keep in mind that
because the elements inside a set are immutable, a set cannot be indexed or sliced. Now, let’s discuss some
additional functions you can use on sets. First, intersection finds the elements that two sets have in common. Union finds all the
elements from both sets. Difference finds the
elements present in one set, but not the other. And symmetric difference
finds elements from both sets that are mutually not
present in the other. Python provides built-in methods for performing each of these functions. Let’s start with the intersection method, denoted by the ampersand. First, define two sets. Then, apply the intersection function, either by attaching the
intersection method to set one, and passing set two to
the method’s argument, or by using the ampersand operator. Great. Now, let’s apply
the union function. Use braces to define the
two sets, X one and X two. The goal is to observe where they overlap. Print them using the union
method on the X two variable or the union operator symbol. Union is a communicable
operation in mathematics, so the overlapping values will be the same no matter what order you
put your variables in. The difference operation on sets, however, is not a communicable operation. Just like in math, if you
subtract five from seven, you get a different result than if you subtract seven from five. You can use either the difference method or the minus sign as a set operator. Subtracting set two from set one gives us only the elements in set one that are not shared with set two. But we don’t know if set
two contains any elements that are not shared with set one. The inverse is also true:
subtracting set one from set two gives us only the elements in set two that are not shared with set one. But we don’t know if set
one contains any elements that are not shared with set two. To get around this and
observe the difference between two sets mutually, use the symmetric difference function. As you might have guessed, you can use the symmetric
difference method or the symmetric difference
operator, expressed by a caret. Symmetric difference
outputs all the elements that the two sets do not share. Excellent work with sets in Python! You’ve learned so much in this section of the course already, and everything you’re
learning is preparing you for a really rewarding
career working with data. Can’t wait to be with you soon again.
In Python, what type of elements does a set contain? Select all that apply.
Unordered, Non-interchangeable
In Python, a set is a data structure that contains only unordered, non-interchangeable elements.
Reading: Reference guide: Sets
Reading
Data professionals depend on sets for separating data and identifying its unique elements. As you have been discovering, set objects are similar to lists and dictionaries, yet they do not have key-value pairs or positional index[i] capability. Additionally, sets contain unique values but have no item order or index behavior. Data professionals compare sets to understand the range of data they contain, where they intersect, and what items are present in either set but not both. Sets are also helpful when cleaning data for analysis. This reading is a reference guide for sets to help you as you continue learning Python.
Save this course item
You may want to save a copy of this guide for future reference. Use it as a resource for additional practice or in your future professional projects. To access a downloadable version of this course item, click the link below and select “Use Template.”
Sets review
A set is a collection of unique data elements, without duplicates. In Python, it is an object class—in fact, two different classes—which you’ll learn about in this reading. However, sets are not unique to Python or even to computer programming; they are an important concept in general mathematics. Sets provide a simple means to identify unique data elements.
Create a set
Create a set using braces:
my_set = {5, 10, 10, 20}
Note that an empty set cannot be created with braces, as this will be interpreted as an empty dictionary.
There are two functions for creating sets in Python: set() and frozenset(). Use these on any iterable object. Or use these functions to create empty sets.
- This is a mutable data type.
- Because it’s mutable, this class comes with additional methods to add and remove data from the set.
- It can be applied to any iterable object and will remove duplicate elements from it.
- It is unordered and non-indexable.
- Elements in a set must be hashable; generally, this means they must be immutable. (Refer to the additional resources for more information on hashing.)
In the examples that follow, four sets are instantiated using a variety of data types:
example_a = [1, 2, 2.0, '2']
set(example_a)
1, 2, '2'}
Notice that, in the preceding example, 2 and 2.0 are evaluated as equivalent, even though one is an integer and the other is a float.
example_b = ('apple', (1, 2, 2, 2, 3), 2)
set(example_b)
{2, 'apple', (1, 2, 2, 2, 3)}
In the preceding example, (1, 2, 2, 2, 3) is a tuple, which is hashable (≈ immutable) and thus treated as a distinct single element in the resulting set.
example_c = [1.5, {'a', 'b', 'c'}, 1.5]
set(example_c)
Error on line 2:
set(example_c)
TypeError: unhashable type: 'set'
The preceding example throws an error because each element of a set must be hashable (≈ immutable), but {‘a’, ‘b’, ‘c’} is a set, which is a mutable (unhashable) object.
The following example demonstrates the add() method, which is one of the special methods available to sets but not to frozensets.
example_d = {'mother', 'hamster', 'father'}
example_d.add('elderberries')
example_d
{'hamster', 'father', 'elderberries', 'mother'}
An element was added to the example_d set, thus modifying it. This is an example of the mutability of the set class.
Frozensets are another type of set in Python. They are their own class, and they are very similar to sets, except they are immutable.
- This is an immutable data type.
- It can be applied to any iterable object and will remove duplicate elements from it.
- Because they’re immutable, frozensets can be used as dictionary keys and as elements in other sets.
In this example, a frozenset is used within a set.
example_e = [1.5, frozenset(['a', 'b', 'c']), 1.5]
set(example_e)
{1.5, frozenset({'a', 'b', 'c'})}
Unlike example_c previously, this set does not throw an error. This is because it contains a frozenset, which is an immutable type and can therefore be used in sets.
Set methods
Sets are useful to determine which values are contained in a data structure and to eliminate duplicate values. There are numerous set methods—such as intersection, union, difference, and symmetric difference—that add functionality and power to working with sets.
- Return a new set with elements from the set and all others.
- The operator for this function is the pipe ( | ).
set_1 = {'a', 'b', 'c'}
set_2 = {'b', 'c', 'd'}
print(set_1.union(set_2))
print(set_1 | set_2)
{'b', 'd', 'a', 'c'}
{'b', 'd', 'a', 'c'}
- Return a new set with elements common to the set and all others.
- The operator for this function is the ampersand (&).
set_1 = {'a', 'b', 'c'}
set_2 = {'b', 'c', 'd'}
print(set_1.intersection(set_2))
print(set_1 & set_2)
{'b', 'c'}
{'b', 'c'}
- Return a new set with elements in the set that are not in the others.
- The operator for this function is the subtraction operator ( – ).
set_1 = {'a', 'b', 'c'}
set_2 = {'b', 'c', 'd'}
print(set_1.difference(set_2))
print(set_1 - set_2)
{'a'}
{'a'}
- Return a new set with elements in either the set or other, but not both.
- The operator for this function is the caret ( ^ ).
set_1 = {'a', 'b', 'c'}
set_2 = {'b', 'c', 'd'}
print(set_1.symmetric_difference(set_2))
print(set_1 ^ set_2)
{'d', 'a'}
{'d', 'a'}
Additional resources
- Refer to the Python documentation for more information about sets and frozensets, including a complete list of available class methods.
- For methods unique to sets (and unavailable to frozensets), refer to this Python set methods documentation.
- For more examples of sets, refer to the Python tutorial on sets.
- For more information on hash tables, what makes something hashable, and hashing as a concept, refer to this resource from Runestone Academy. For an interesting story about the birth of the original hashing algorithms, check out this IEEE Spectrum article.
Lab: Exemplar: Dictionaries & sets
Practice Quiz: Test your knowledge: Dictionaries and sets
Fill in the blank: In Python, a dictionary’s _____ must be immutable.
keys
In Python, a dictionary’s keys must be immutable. Immutable keys include, but are not limited to, integers, floats, tuples, and strings. Lists, sets, and other dictionaries are not included in this category since they are mutable.
In Python, what does the items() method retrieve?
Both a dictionary’s keys and values
In Python, the items() method is used to retrieve both a dictionary’s keys and values.
A data professional is working with two Python sets. What function can they use to find all the elements from both sets?
union()
When working with two Python sets, a data professional can use the union() function to find all the elements from both sets.
Arrays and vectors with Numpy
Video: The power of packages
Python has many advanced features that can be used for data work and other scientific applications. These features are not included in basic Python, so it’s necessary to add them to your scripts.
- Libraries, packages, and modules are reusable collections of code that provide additional functionality.
- Libraries are often used interchangeably with packages.
- Commonly used libraries for data work are matplotlib, seaborn, NumPy, and pandas.
- Modules are accessed from within a package or a library.
- Modules are used to organize functions, classes, and other data in a structured way.
- Commonly used modules for data professional work are math and random.
- There are several ways to import modules.
Video: Introduction to NumPy
NumPy is a powerful and dynamic Python library that contains multidimensional array and matrix data structures, as well as functions to manipulate them. Its power comes from vectorization, which enables operations to be performed on multiple components of a data object at the same time. This makes it particularly useful for data professionals who work with large quantities of data. NumPy is also efficient because it computes more efficiently than traditional for loops. Vectors also take up less memory space, which is another important factor when working with a lot of data. In addition to being highly useful in its own right, NumPy powers a lot of other Python libraries, like pandas, so understanding how NumPy works will help you use these other packages.
Introduction to NumPy
NumPy is a Python library that provides a powerful and efficient way to work with arrays and matrices. It is an essential tool for data scientists, machine learning engineers, and anyone who works with numerical data.
What is NumPy?
NumPy stands for “Numerical Python”. It is a library that provides a variety of functions and data structures for working with numerical data. NumPy arrays are stored in memory in a way that makes them very efficient for operations such as arithmetic, sorting, and filtering.
Why use NumPy?
There are several reasons why NumPy is a popular choice for working with numerical data:
- Speed: NumPy arrays are stored in memory in a way that makes them very efficient for operations such as arithmetic, sorting, and filtering. This can make your code run much faster, especially when you are working with large datasets.
- Flexibility: NumPy supports a variety of data types, including integers, floats, strings, and complex numbers. It also supports a variety of operations, including arithmetic, sorting, filtering, and statistical functions.
- Integration with other Python libraries: NumPy is well-integrated with other Python libraries, such as pandas and Matplotlib. This makes it easy to use NumPy with other tools for data analysis and visualization.
Getting started with NumPy
To get started with NumPy, you first need to install it. You can do this using the following command:
pip install numpy
Once NumPy is installed, you can import it into your Python scripts using the following import statement:
Python
import numpy as np
Creating NumPy arrays
There are several ways to create NumPy arrays. One way is to use the np.array()
function. For example, the following code creates a NumPy array of integers:
Python
array = np.array([1, 2, 3, 4, 5])
You can also create NumPy arrays from lists, tuples, and other data structures. For example, the following code creates a NumPy array from a list of strings:
Python
array = np.array(['a', 'b', 'c', 'd', 'e'])
Working with NumPy arrays
Once you have created a NumPy array, you can work with it using a variety of methods and functions. For example, the following code prints the shape of the array:
Python
print(array.shape)
The shape
attribute of a NumPy array is a tuple that contains the dimensions of the array. In this case, the shape of the array is (5,)
, which means that the array has one dimension and contains five elements.
You can also access individual elements of a NumPy array using indexing. For example, the following code prints the second element of the array:
Python
print(array[1])
This will print the value 2
.
NumPy also supports a variety of operations, including arithmetic, sorting, filtering, and statistical functions. For example, the following code adds the number 10 to each element of the array:
Python
array += 10
This will modify the array so that it contains the values 11, 12, 13, 14, 15
.
Conclusion
NumPy is a powerful and versatile library that is essential for anyone who works with numerical data. It provides a variety of functions and data structures that make it easy to work with arrays and matrices. If you are working with data science, machine learning, or any other field that involves numerical data, I encourage you to learn more about NumPy.
I hope this tutorial has been helpful. Please let me know if you have any questions.
Fill in the blank: In NumPy, _____ enables operations to be performed on multiple components of a data object at the same time.
vectorization
In NumPy, vectorization enables operations to be performed on multiple components of a data object at the same time. Data professionals often work with large datasets, and vectorized code helps them efficiently compute large quantities of data.
You’ve learned that
part of what makes Python such a powerful and dynamic
language are the packages and libraries that are available to it. One of the most widely used and
important of these is NumPy. Recall that NumPy contains
multidimensional array and matrix data structures, as well as functions to manipulate them. You’ll learn more about
these data structures and functions soon, but for
now, let’s just consider NumPy at a high level and learn more about it. NumPy’s power comes from vectorization. Vectorization enables
operations to be performed on multiple components of a
data object at the same time. This is particularly useful
for data professionals because their jobs often involve working with very large quantities of data, and vectorized code saves a lot of time because it computes more efficiently. Let’s explore this a little more. Suppose we have list A and
list B, both the same length, and we want to create a new list C that’s the element-wise
product of each list. In other words, we want to
multiply the first element of list A by the first element of list B, then multiply the second element of list A by the second element
of list B, et cetera. If we try to multiply
lists A and B together, the computer throws an error. To perform this operation,
we could write a for loop. We’d start by defining
an empty list, list C, then make a range of indices to loop over, and append the product
of list A and list B at each index to list C. This gets the job done,
but if you’re thinking, “There’s gotta be an
easier way,” you’re right! We can use NumPy to perform this operation as a vectorized computation. Simply convert each list to a NumPy array and multiply the two arrays together using the product operator. The results are the same, but the vectorized approach
is simpler, easier to read, and faster to execute because while loops iterate over
one element at a time, vector operations compute simultaneously in a single statement. The efficiency of this
might not be noticeable now, but when working with
large datasets it will be. Vectors also take up less memory space, which is another factor
that becomes important when working with a lot of data. You might have noticed
that when we used NumPy, we first had to import it. This is called an import statement. An import statement
uses the import keyword to load an external
library, package, module, or function into your
computing environment. Once you import something
into your notebook and run that cell, you don’t
need to import it again unless you restart your notebook. When we import NumPy, we import it as NP. This is known as aliasing. Aliasing lets you assign an
alternate name, or alias, by which you can refer to something. In this case, we’re
abbreviating NumPy to NP. Notice the NPs in the code
below the import statement where we create our arrays. If we didn’t give NumPy an alias of NP, we’d have to type out “numpy”
here in order to access its array function. Aliasing as NP makes the code
shorter and easier to read. Note that NP is the standard alias. If you use something else, other people might get confused
when reading your code. In addition to being highly
useful in its own right, NumPy powers a lot of
other Python libraries, like pandas, so understanding
how NumPy works will help you use these other packages. There’s a lot more to
discover about NumPy. Coming up, you’ll learn about
its core data structures and functionalities. See you soon.
The Power of Packages in Python
Python is a versatile and powerful programming language that has gained immense popularity in recent years. One of the key factors contributing to Python’s success is its extensive ecosystem of packages. Packages are collections of pre-written code that provide a wide range of functionality, making it easier and more efficient to develop Python applications.
What are Packages?
In Python, a package is a collection of Python modules that are organized together under a single namespace. Packages provide a structured way to manage and share code, making it easier to reuse and extend existing code. They also encapsulate specialized functionality into modular units, promoting code reusability and maintainability.
Why Use Packages?
There are several compelling reasons to use packages in Python development:
- Code Reusability: Packages eliminate the need to re-invent the wheel, allowing developers to leverage existing code written by others. This saves time and effort, enabling developers to focus on the core logic of their applications rather than spending time on basic tasks.
- Modular Development: Packages promote modular development, breaking down complex applications into smaller, manageable modules. This modular approach enhances code organization, making it easier to understand, maintain, and debug.
- Specialization: Packages encapsulate specialized functionality into modular units, allowing developers to access and utilize specific features without having to understand the underlying implementation details. This promotes code reusability and reduces the learning curve for developers.
- Community Support: Many popular packages have active communities of developers who contribute bug fixes, updates, and new features. This ensures long-term maintenance and support for the package.
- Standardization: Packages promote standardization by providing consistent and well-documented code structures. This makes it easier for developers to collaborate and understand each other’s code.
How to Install and Use Packages
Installing and using packages in Python is straightforward. The pip tool, which is included with Python, is used to manage package installation. To install a package, simply run the following command in your terminal:
Bash
pip install <package-name>
Once a package is installed, you can import its modules into your Python scripts using the import
statement:
Python
import <package-name>
This will make the package’s modules available for use in your script. For example, to use the pandas
package for data manipulation, you would import it as follows:
Python
import pandas as pd
Popular Python Packages
The Python ecosystem boasts a vast collection of packages catering to various domains and applications. Here are some of the most popular and widely used packages:
- NumPy: A fundamental library for scientific computing, providing efficient numerical operations on arrays and matrices.
- pandas: A powerful data analysis library built on NumPy, providing tools for data manipulation, cleaning, and analysis.
- Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations.
- Seaborn: A data visualization library built on Matplotlib, providing a higher-level interface for creating common plots and graphs.
- scikit-learn: A machine learning library providing a wide range of algorithms and tools for supervised and unsupervised learning.
- requests: An HTTP library for making web requests and interacting with APIs.
- BeautifulSoup: A library for parsing and extracting data from HTML documents.
- Flask: A web framework for building web applications and APIs.
- Django: A powerful web framework for building complex web applications with a robust architecture.
Conclusion
Packages play a pivotal role in Python development, providing a wealth of pre-written code and functionality. By leveraging packages, developers can streamline the development process, enhance code reusability, and access specialized tools for various tasks. Whether you’re building data analysis applications, web applications, or machine learning models, packages are essential for efficient and effective Python development.
Python has many advanced
calculation capabilities that are used for data work and other scientific applications. These features make it possible to extend, enhance, and reuse parts of the code. To access these features, you can import them from libraries, packages, and modules. These features are not
included in basic Python, so it’s necessary to add
them to your scripts. The additional functionality
can save you time constructing functions and
objects in your own work. Using these features can also help you obtain extra data types for analyzing data or building machine learning models. Let’s start with libraries. A library, or package, broadly refers to a reusable collection of code. It also contains related
modules and documentation. You’ll often encounter the terms library and package used interchangeably. Commonly used libraries for data work are matplotlib and seaborn. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Seaborn is a data visualization library that’s based on matplotlib. It provides a simpler interface for working with common plots and graphs. This certificate program integrates two other commonly used
libraries for data work: NumPy and pandas. NumPy, or Numerical Python, is an essential library that contains multidimensional array
and matrix data structures and functions to manipulate them. This library is used for
scientific computation. And pandas, which stands
for Python Data Analysis, is a powerful library
built on top of NumPy that’s used to manipulate
and analyze tabular data. There are many other
popular Python libraries and packages for data professional work, such as scikit-learn,
statsmodels, and others. Scikit-learn, a library, and statsmodels, a package, consist of
functions data professionals can use to test the performance of statistical models. They’re used across
various scientific fields. Scikit-learn and statsmodels
are pretty advanced, so you won’t be working
with them in this course, but you’ll have opportunities to work with these libraries
elsewhere in the program. Again, different
practitioners across the field often conflate libraries and packages, so you may hear them referred to one, the other, or both ways. Libraries and packages
provide sets of modules that are essential for data professionals. Modules are accessed from
within a package or a library. They are Python files
that contain collections of functions and global variables. Global variables differ
from other variables because these variables can be accessed from anywhere in a program or script. Modules are used to organize functions, classes, and other data
in a structured way. Internally, modules are set up through separate files that contain these necessary classes and functions. When you import a module, you are using pre-written code components. Each module is an executable file that can be added to your scripts. Commonly used modules for
data professional work are math and random. Math provides access to
mathematical functions. And random is used to
generate random numbers. This is useful when selecting random elements from a list; shuffling elements randomly; or working with random sampling, which you’ll explore in a later course. There are several ways to import modules, depending on whether you want to use the whole package or just a single, pre-defined function or feature. This adds functionality for carrying out specialized operations. There’s lots more to
learn about libraries, packages, and modules, so feel free to refer
to the course resources for more information on installing these features and to continue growing your Python knowledge. But as a reminder, you don’t
have to install anything, because everything you need to complete the different sections of this certificate program are already built into the notebooks you’ll be using in Coursera. I’ll introduce you to some
libraries in the next video.
Reading: Understand Python libraries, packages, and modules
Reading
Recently, you learned about Python libraries, packages, and modules. As you’ve discovered, importing these tools saves data professionals time and enhances their programming. Another benefit of commonly used libraries is that they are constantly scrutinized and updated by talented and knowledgeable programmers. Thus, you can be confident that the underlying code is high quality.
In this reading, you’ll learn more about the basic features of libraries, packages, and modules; how they are related; and a selection of basic modules you might use as a data professional.
Libraries, packages, and modules
A library is a corpus of reusable code modules and their accompanying documentation. Libraries are bundled into packages that you install, which can then be imported into your coding environment as needed. You’ll typically encounter the terms “library” and “package” used interchangeably. Generally, this certificate program will refer to both as libraries, but it’s important to be acquainted with both terms.
Modules are similar to libraries, in that they are groups of related classes and functions, but they are generally subcomponents of libraries. In other words, a library can have many different modules, and you can choose to import the entire library or just the module you need.
Import statements
Libraries and modules beyond the Python standard library typically must be imported into your working environment on an as-need basis. Additional libraries are installed first and then imported into your working environment as needed.
To import a library or module, use an import statement. Import statements require particular syntax using the import keyword. Here are some examples:
Note: The following code block is not interactive.
import numpy
This import statement imports the NumPy library into your working environment. After running this command, you’ll have access to all NumPy classes and functions. For instance, to use the array() function on [2, 4, 6], you’d write:
Note: The following code block is not interactive.
numpy.array([2, 4, 6])
Notice that to access the array() function, you must precede it with numpy, because this indicates that the function is coming from the NumPy library.
Aliasing
Another time-saver with Python libraries is aliasing. Aliasing helps you avoid typing a library’s full name every time you want to access one of its functions. Instead, you’ll assign the library an alias. An alias is an abbreviated name, which is designated using the as keyword:
Note: The following code block is not interactive.
import numpy as np
In this case, the NumPy library is imported with the np alias. You can assign any abbreviation you like as an alias, but commonly used libraries have common aliases. Therefore, straying from those could cause confusion when sharing code with others. Here are some common libraries and their conventional aliases used by data professionals:
Note: The following code block is not interactive.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
NumPy is used for high-performance vector and matrix computations. Pandas is a library for manipulating and analyzing tabular data. Seaborn and matplotlib are both libraries used to create graphs, charts, and other data visualizations.
After running these imports, whenever you want to use a function from one of these libraries, precede the function with the alias. Returning to the example with NumPy’s array() function, after aliasing, you’d write:
Note: The following code block is not interactive.
np.array([2, 4, 6])
Additional import syntax
Importing modules
Recall from the previous example:
Note: The following code block is not interactive.
import matplotlib.pyplot as plt
You may have noticed that this syntax differs slightly from the other examples. In this case, matplotlib is the library and pyplot is a module inside. The pyplot module is aliased as plt, and it’s accessed from the matplotlib library using the dot.
Importing functions
Just as you can import libraries and modules, you can also import individual functions from libraries or from modules within libraries using a specific syntax. Here’s an example depicting a common import when using the scikit-learn library to build machine learning models:
Note: The following code block is not interactive.
from sklearn.metrics import precision_score, recall_score
Again, notice the different syntax. The import statement begins with the from keyword, followed by sklearn.metrics—the scikit-learn library + the metrics module. Next is the import keyword followed by the desired functions. In this case, there are two: precision_score and recall_score.
The same syntax can be applied to the example using NumPy’s array() function. However, note that you typically would not encounter individual functions being imported from NumPy. It’s much easier and more common to just import the whole library.
Note: The following code block is not interactive.
from numpy import array
When a function is imported by name, like in this example, you can use it without any preceding syntax to indicate the library or module that it comes from:
Note: The following code block is not interactive.
array([2,4,6])
Discouraged syntax
One last syntactical variation that you might encounter is:
Note: The following code block is not interactive.
from library.module import *
This imports everything from a particular library or module and allows you to use its functions without any preceding syntax. So, for instance, if you wrote from numpy import *, you’d be able to use all of NumPy’s functions without preceding them with numpy or np. This approach is not recommended because it makes it difficult to track where functions come from. However, it’s helpful to be aware of this because you will likely encounter it in your work as a data professional. And, in specific instances, it might be useful.
Commonly used built-in modules
The Python standard library comes with a number of built-in modules relevant to data professional work such as math, datetime, and random. These can be imported without additional installation. In other words, you can import them directly, as long as you have Python installed. For example:
- Provides many helpful date and time conversions and calculations
import datetime
date = datetime.date(1977, 5, 8) # assign a date to a variable
print(date) # print date
print(date.year) # print the year that the date is in
delta = datetime.timedelta(days=30) # assign a timedelta of 30 days to a
# variable
print(date - delta) # print date of 30 days prior
1977-05-08
1977
1977-04-08
- Provides access to mathematical functions
import math
print(math.exp(0)) # e**0
print(math.log(1)) # ln(1)
print(math.factorial(4)) # 4!
print(math.sqrt(100)) # square root of 100
1.0
0.0
24
10.0
- Useful for generating random numbers
import random
print(random.random()) # 0.0 <= X < 1.0
print(random.choice([1, 2, 3])) # choose a random element from a sequence
print(random.randint(1, 10)) # a <= X <= b
0.28461065272727804
3
1
Key takeaways
Libraries, packages, and modules are gateways to Python’s countless capabilities. Understanding how to leverage them for your own coding needs will unlock new tools and functions that make your work much more efficient. Check out the Python Package Index at the PyPI repository to search for useful libraries. There are packages designed for applications as diverse as chemistry, audio editing, natural language processing, and video games. Whatever it is you’re trying to do, chances are someone has developed a suite of tools to help you!
Video: Basic array operations
In this tutorial, the speaker introduces the NumPy library and its significance in making data manipulation faster and more efficient through vectorization. The core data structure of NumPy, known as an “n-dimensional array” or ndarray, is highlighted. The tutorial covers the creation of ndarrays from Python objects, emphasizing their mutability and the ability to change values.
The importance of maintaining a consistent data type within an array is discussed, and the consequences of mixing data types are demonstrated. The tutorial also covers the use of the dtype attribute to check the data type of array contents. The concept of multidimensional arrays is introduced, including one-dimensional, two-dimensional, and three-dimensional arrays.
The shape and ndim attributes are explained as tools to confirm the structure and number of dimensions of an array. The tutorial touches on reshaping arrays using the reshape method, emphasizing its relevance in data analysis tasks. The speaker briefly mentions the vast capabilities of NumPy, including mathematical and statistical operations, and the importance of understanding NumPy basics for working with libraries like Pandas in data analysis.
The tutorial concludes by highlighting NumPy’s integral role in advanced data analysis and its frequent use in conjunction with other libraries and packages.
Basic Array Operations in Python
Arrays are a fundamental data structure in Python, used to store and manipulate collections of data. They offer an efficient way to represent and work with large datasets compared to individual variables. This tutorial will introduce you to basic array operations in Python, equipping you to handle data with ease.
1. Creating Arrays:
There are several ways to create arrays in Python:
- Using
list
:
Python
my_list = [1, 2, 3, 4, 5]
print(my_list) # Output: [1, 2, 3, 4, 5]
- Using
numpy.array
:
Python
import numpy as np
my_array = np.array([1, 2, 3, 4, 5])
print(my_array) # Output: array([1, 2, 3, 4, 5])
- Using built-in functions:
Python
my_range = range(1, 6)
print(my_range) # Output: range(1, 6)
my_zeros = np.zeros(5)
print(my_zeros) # Output: [0. 0. 0. 0. 0.]
2. Accessing Elements:
Elements in an array can be accessed using their index (position) within square brackets:
Python
print(my_list[2]) # Output: 3
# Negative indexing starts from the end
print(my_list[-1]) # Output: 5
# Accessing sub-arrays
print(my_list[1:3]) # Output: [2, 3]
3. Modifying Arrays:
You can modify existing elements or add new ones using assignment:
Python
my_list[0] = 10
print(my_list) # Output: [10, 2, 3, 4, 5]
my_list.append(6)
print(my_list) # Output: [10, 2, 3, 4, 5, 6]
# Note: Be careful when modifying size, use reassignment for that
4. Array Operations:
Python provides various operators for working with arrays:
- Arithmetic operations: Addition, subtraction, multiplication, division (element-wise)
- Comparison operators: Less than, greater than, equal to, etc. (element-wise)
- Logical operators: AND, OR, NOT (element-wise)
Python
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b) # Output: array([5, 7, 9])
print(a > b) # Output: array([False, False, False])
print(np.all(a > b)) # Output: False
5. Useful Functions:
len(array)
: Returns the number of elements.np.sum(array)
: Calculates the sum of all elements.np.mean(array)
: Computes the average value.np.sort(array)
: Sorts the elements in ascending order.
These are just a few basic operations. As you learn more about Python, you’ll discover more advanced features and libraries like NumPy that offer extensive array manipulation capabilities.
Remember, practice is key! Start exploring and experimenting with these operations on your own data sets. The more you work with arrays, the more comfortable and efficient you’ll become in handling data analysis tasks.
I hope this tutorial provides a solid foundation for your journey into the world of arrays in Python! Feel free to ask if you have any questions.
Welcome back. You’ve recently learned that the NumPy library uses vectorization to make working with data
faster and more efficient, which makes your job easier. I demonstrated how NumPy performed an element-wise
multiplication of two lists by converting the lists to arrays and then simply multiplying them together. Now, we’re going to continue
learning about arrays and how to work with them. The array is the core
data structure of NumPy. The data object itself is known as an “n-dimensional
array,” or ndarray for short. The ndarray is a vector. Recall that vectors enable many operations to be performed together
when the code is executed, resulting in faster run-times that require less computer memory. You can create an ndarray
from a Python object by passing the object to
the ndarray function. ndarrays are mutable, so you can change the values they contain. If I want to change this last
value from a four to a five, I can do that by identifying
the index number. Since we’re dealing with
the last value, here, it’s necessary to use a negative one. But, I can’t change the size of the array without reassigning it. If I try to add a number
to the end of this array, the computer throws an error. So if you want to change
the size of an array, you have to reassign it. Another requirement of the array is that all of its elements
be of the same data type. If I create an array with
the integers one, two, and then a string of “coconut,” NumPy will create an array that forces everything to the
same data type, if possible. In this case, everything becomes a string, represented here by
“U21,” meaning unicode 21. So be careful when creating your arrays that they all contain
data of the same type, or, if they don’t, that
this is intentional and useful to what you’re doing. You previously learned that calling the “type”
function on an object will return the data type of the object. If we do that with an array, as you might expect, we get a NumPy array. We can use the dtype attribute if we want to check the data type of the contents of an array. Here, the dtype attribute indicates that this array consists of integers. As the name implies, ndarrays
can be multidimensional. For a one-dimensional array, NumPy takes an array-like
object of length X, like a list, and creates an ndarray in the shape of X. A one-dimensional array is
neither a row nor a column. We can use the shape attribute to confirm the shape of an array. We can also use the ndim attribute to confirm the number of
dimensions the array has. Data professionals will
often need to confirm the shape and number of
dimensions of their array. For example, if they’re
trying to attach it to another existing array. These methods are also commonly used to help understand what’s going wrong when your code throws an error. A two-dimensional array can be
created from a list of lists, where each internal
list is the same length. You can think of these internal
lists as individual rows, so the final array is like a table. Notice that this array has a shape of four rows by two columns and is two dimensions. If a two-dimensional
array is a list of lists, then a three-dimensional array is a list that contains two of these, so a list of two lists of lists. This array can be
thought of as two tables, each with two rows and three columns. Thus, it has three dimensions. This can go on indefinitely. Thankfully, there are ways to help simplify working
with multidimensional arrays, which you’ll learn about later. And unless you’re doing very advanced scientific computations, you typically won’t be working
directly with NumPy arrays that are more than three dimensions. NumPy also lets us reshape an array using the reshape method. Our two-dimensional array was
four rows and two columns. But what if we wanted this data to be two rows by four columns? We just plug these values
into the reshape method and reassign the result back
to the array 2D variable. Reshaping data is a common
task in data analysis, so it’s important for you to be familiar with what it means and how it works. There are many other operations that can be performed with arrays, and you’ll surely learn those as the need for them
arises in your projects. But there are other helpful
functions and methods in NumPy that you’ll use regularly. These include functions to
calculate things like mean, and natural logarithm, and floor and ceiling operations, which round numbers to the nearest lesser and greater whole number, respectively. And many other frequently used mathematical and statistical operations. NumPy is very robust. There are so many things
you can do with it that we can only briefly
consider them here. As you know, NumPy powers many other useful libraries and packages. In this certificate program, we won’t be working a
lot directly with NumPy, but we will be working
a lot with a library that depends on it: Pandas. It’s important that you
understand the basics of NumPy because it will help you when
you start working with Pandas. As you develop your skills
as a data professional, you’ll find yourself returning
to NumPy time and time again because it’s such an integral part of advanced data analysis. Bye for now.
Reading: Reference guide: Arrays
Reading
As you’ve learned, NumPy is a powerful library capable of performing advanced numerical computing. One of its main benefits is the ability to work with arrays, as an operation applied to a vector executes much faster than the same operation applied to a list. Performance increases become further apparent when working with large volumes of data. This reading is a reference guide for working with NumPy arrays.
Save this course item
You may want to save a copy of this guide for future reference. Use it as a resource for additional practice or in your future professional projects. To access a downloadable version of this course item, click the link below and select “Use Template.”
Create an array
As you’ve discovered, to use NumPy, you must first import it. Standard practice is to alias it as np.
This creates an ndarray (n-dimensional array). There is no limit to how many dimensions a NumPy array can have, but arrays with many dimensions can be more difficult to work with.
1-D array:
import numpy as np
array_1d = np.array([1, 2, 3])
array_1d
[1 2 3]
Notice that a one-dimensional array is similar to a list.
2-D array:
array_2d = np.array([(1, 2, 3), (4, 5, 6)])
array_2d
[[1 2 3]
[4 5 6]]
Notice that a two-dimensional array is similar to a table.
3-D array:
array_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
array_3d
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
Notice that a three-dimensional array is similar to two tables.
- This creates an array of a designated shape that is pre-filled with zeros:
np.zeros((3, 2))
[[ 0. 0.]
[ 0. 0.]
[ 0. 0.]]
- This creates an array of a designated shape that is pre-filled with ones:
np.ones((2, 2))
[[ 1. 1.]
[ 1. 1.]]
- And this creates an array of a designated shape that is pre-filled with a specified value:
np.full((5, 3), 8)
[[ 8. 8. 8.]
[ 8. 8. 8.]
[ 8. 8. 8.]
[ 8. 8. 8.]
[ 8. 8. 8.]]
These functions are useful for various situations:
- To initialize an array of a specific size and shape, then fill it with values derived from a calculation
- To allocate memory for later use
- To perform matrix operations
Array methods
NumPy arrays have many methods that allow you to manipulate and operate on them. For a full list, refer to the NumPy array documentation. Some of the most commonly used methods follow:
- This returns a copy of the array collapsed into one dimension.
array_2d = np.array([(1, 2, 3), (4, 5, 6)])
print(array_2d)
print()
array_2d.flatten()
[[1 2 3]
[4 5 6]]
[1 2 3 4 5 6]
- This gives a new shape to an array without changing its data.
array_2d = np.array([(1, 2, 3), (4, 5, 6)])
print(array_2d)
print()
array_2d.reshape(3, 2)
[[1 2 3]
[4 5 6]]
[[1 2]
[3 4]
[5 6]]
Adding a value of -1 in the designated new shape makes the process more efficient, as it indicates for NumPy to automatically infer the value based on other given values.
array_2d = np.array([(1, 2, 3), (4, 5, 6)])
print(array_2d)
print()
array_2d.reshape(3, -1)
[[1 2 3]
[4 5 6]]
[[1 2]
[3 4]
[5 6]]
- This converts an array to a list object. Multidimensional arrays are converted to nested lists.
array_2d = np.array([(1, 2, 3), (4, 5, 6)])
print(array_2d)
print()
array_2d.tolist()
[[1 2 3]
[4 5 6]]
[[1, 2, 3], [4, 5, 6]]
Mathematical functions
NumPy arrays also have many methods that are mathematical functions:
- ndarray.max(): returns the maximum value in the array or along a specified axis.
- ndarray.mean(): returns the mean of all the values in the array or along a specified axis.
- ndarray.min(): returns the minimum value in the array or along a specified axis.
- ndarray.std(): returns the standard deviation of all the values in the array or along a specified axis.
a = np.array([(1, 2, 3), (4, 5, 6)])
print(a)
print()
print(a.max())
print(a.mean())
print(a.min())
print(a.std())
[[1 2 3]
[4 5 6]]
6
3.5
1
1.70782512766
Array attributes
NumPy arrays have several attributes that enable you to access information about the array. Some of the most commonly used attributes include the following:
- ndarray.shape: returns a tuple of the array’s dimensions.
- ndarray.dtype: returns the data type of the array’s contents.
- ndarray.size: returns the total number of elements in the array.
- ndarray.T: returns the array transposed (rows become columns, columns become rows).
array_2d = np.array([(1, 2, 3), (4, 5, 6)])
print(array_2d)
print()
print(array_2d.shape)
print(array_2d.dtype)
print(array_2d.size)
print(array_2d.T)
[[1 2 3]
[4 5 6]]
(2, 3)
int64
6
[[1 4]
[2 5]
[3 6]]
Indexing and slicing
Access individual elements of a NumPy array using indexing and slicing. Indexing in NumPy is similar to indexing in Python lists, except multiple indices can be used to access elements in multidimensional arrays.
a = np.array([(1, 2, 3), (4, 5, 6)])
print(a)
print()
print(a[1])
print(a[0, 1])
print(a[1, 2])
[[1 2 3]
[4 5 6]]
[4 5 6]
2
6
Slicing may also be used to access subarrays of a NumPy array:
a = np.array([(1, 2, 3), (4, 5, 6)])
print(a)
print()
a[:, 1:]
[[1 2 3]
[4 5 6]]
[[2 3]
[5 6]]
Array operations
NumPy arrays support many operations, including mathematical functions and arithmetic. These include array addition and multiplication, which performs element-wise arithmetic on arrays:
a = np.array([(1, 2, 3), (4, 5, 6)])
b = np.array([[1, 2, 3], [1, 2, 3]])
print('a:')
print(a)
print()
print('b:')
print(b)
print()
print('a + b:')
print(a + b)
print()
print('a * b:')
print(a * b)
a:
[[1 2 3]
[4 5 6]]
b:
[[1 2 3]
[1 2 3]]
a + b:
[[2 4 6]
[5 7 9]]
a * b:
[[ 1 4 9]
[ 4 10 18]]
In addition, there are nearly 100 other useful mathematical functions that can be applied to individual or multiple arrays.
Mutability
NumPy arrays are mutable, but with certain limitations. For instance, an existing element of an array can be changed:
a = np.array([(1, 2), (3, 4)])
print(a)
print()
a[1][1] = 100
a
[[1 2]
[3 4]]
[[ 1 2]
[ 3 100]]
However, the array cannot be lengthened or shortened:
a = np.array([1, 2, 3])
print(a)
print()
a[3] = 100
a
Error on line 5:
a[3] = 100
IndexError: index 3 is out of bounds for axis 0 with size 3
How NumPy arrays store data in memory
NumPy arrays work by allocating a contiguous block of memory at the time of instantiation. Most other structures in Python don’t do this; their data is scattered across the system’s memory. This is what makes NumPy arrays so fast; all the data is stored together at a particular address in the system’s memory.
Interestingly, this is also what prevents an array from being lengthened or shortened: The abutting memory is occupied by other information. There’s no room for more data at that memory address. However, existing elements of the array can be replaced with new elements.
![](https://i0.wp.com/stackfolio.xyz/wp-content/uploads/2023/12/system-memory-.png?resize=711%2C332&ssl=1)
The only way to lengthen an array is to copy the existing array to a new memory address along with the new data.
Exemplar: Arrays and vectors with NumPy
Practice Quiz: Test your knowledge: Arrays and vectors with NumPy
Python libraries and packages include which of the following features? Select all that apply.
Modules, Reusable collections of code, Documentation
A Python library, or package, broadly refers to a reusable collection of code. Libraries and packages also contain related modules and documentation. You’ll often encounter the terms library and package used interchangeably.
What is the core data structure of NumPy?
Array
The array is the core data structure of NumPy. The data object itself is known as an n-dimensional array, or ndarray for short. An array can be multidimensional, and all its elements must be of the same data type.
A data professional wants to confirm the datatype of the contents of array x. How would they do this?
x.dtype
dtype is a NumPy array attribute used to check the data type of the contents of an array.dtype is a NumPy array attribute used to check the data type of the contents of an array.
Dataframes with pandas
Video: Introduction to pandas
Main Points:
- Pandas is a powerful Python library for data manipulation and analysis.
- It provides a convenient interface for working with tabular data.
- Pandas allows for easy data loading from various formats (CSV, Excel, etc.).
- Key functionalities include:
- Data manipulation and filtering.
- Statistical analysis (mean, min, max, standard deviation, etc.).
- Data grouping and aggregations.
- Custom calculations.
- Pandas simplifies data analysis tasks and provides intuitive visualization.
Key Takeaways:
- Pandas is an essential tool for data professionals and data analysis tasks.
- Its user-friendly interface makes it easier to work with data than NumPy.
- Pandas offers a wide range of functionalities for efficient data manipulation and analysis.
- Learning Pandas opens doors to powerful data exploration and insights.
Additional Notes:
- Dataframe is the core data structure in Pandas, representing tables with rows and columns.
- Pandas supports various data types like integers, floats, strings, and booleans.
- Filtering allows focusing on specific subsets of data based on simple or complex logic.
- Pandas enables manipulation of data, such as adding calculated columns.
- Grouping and aggregation facilitate analysis of data subsets based on specific criteria.
What is Pandas?
Pandas is a powerful and popular Python library for data analysis and manipulation. It provides efficient data structures and functions for working with tabular data, commonly known as spreadsheets or tables with rows and columns. Think of it as a toolbox specifically designed to handle and analyze your data with ease.
Why Use Pandas?
There are several reasons why Pandas is a preferred choice for data analysis in Python:
- Simplicity: Pandas offers a user-friendly interface that makes it easier to comprehend and work with data compared to lower-level libraries like NumPy.
- Efficiency: Pandas provides optimized data structures and algorithms for efficient data handling, leading to faster processing and analysis.
- Functionality: Pandas is packed with various functionalities for data manipulation, analysis, and visualization.
- Versatility: Pandas supports loading data from various formats like CSV, Excel, databases, and more.
- Popularity: Pandas is a widely used and well-supported library, providing a vast community for support and learning resources.
Getting Started with Pandas
1. Installation:
Ensure you have Python installed and run the following command to install Pandas:
Bash
pip install pandas
2. Import Pandas:
In your Python script, import Pandas using the following syntax:
Python
import pandas as pd
3. Loading Data:
Pandas offers various functions for loading data from different sources. Here are some common examples:
Python
# Load data from a CSV file
data = pd.read_csv("data.csv")
# Load data from an Excel spreadsheet
data = pd.read_excel("data.xlsx")
# Load data from a URL
data = pd.read_url("https://example.com/data.csv")
4. Exploring Data:
Once you have loaded your data into a Pandas dataframe, you can explore it using various methods:
- Accessing specific data: Use indexing and slicing to access rows, columns, or individual cells.
- Descriptive statistics: Calculate summary statistics like mean, standard deviation, minimum, and maximum values.
- Data filtering: Select subsets of data based on specific conditions and criteria.
- Data sorting: Sort data based on specific columns in ascending or descending order.
5. Data Manipulation:
Pandas allows you to modify and manipulate your data:
- Adding and removing columns: Add new columns based on calculations or logic, or remove unnecessary columns.
- Editing data: Modify existing values in your dataframe.
- Merging and joining dataframes: Combine data from multiple sources.
6. Data Visualization:
Pandas offers built-in functions for visualizing your data:
- Series plots: Create line plots, bar charts, etc., for single data series.
- DataFrame plots: Generate scatter plots, heatmaps, boxplots, and more.
Learning Resources:
Here are some resources to learn more about Pandas:
- Official Pandas Documentation: https://pandas.pydata.org/docs/
- Pandas Tutorial: https://www.datacamp.com/courses/intro-to-python-for-data-science
- Pandas Cheat Sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- Interactive Pandas Tutorial: https://www.tutorialspoint.com/python_pandas/index.htm
Next Steps:
This tutorial provides a brief introduction to Pandas. As you progress, you’ll delve deeper into its functionalities, explore advanced data analysis techniques, and master the art of data manipulation and visualization with Pandas.
Hi there. We previously discussed NumPy, and how it’s an important tool for data professionals and anyone else whose job requires high
performance computational power. We also investigated how other libraries and packages use NumPy because of the efficiencies
that come with vectorization. One of these libraries is pandas, a quintessential tool both
in this certificate program and in the world of data analytics. In this lesson, you’re going
to learn more about pandas and why it’s so useful. Because pandas is a library that adds functionality
to Python’s core toolset, you have to import it. Similar to how we imported NumPy as NP, pandas has its own standard alias of PD. Typically, when using pandas, you import both NumPy and pandas together. This is just for convenience, given that NumPy is often used
in conjunction with pandas. Strictly speaking, you don’t have to import
NumPy to work in pandas. Pandas is fully operational on its own. Pandas’ key functionality is the manipulation and
analysis of tabular data – that is, data that’s in the form of a table, with rows and columns. A spreadsheet is a common
example of tabular data. While NumPy is capable of many of the same functions
and operations as pandas, it’s not always as easy to work with because it requires you to work more abstractly with the data and keep track of what’s being done to it, even if you can’t see it. Pandas, on the other hand, provides a simple
interface that allows you to display your data as rows and columns. This means that you can
always follow exactly what’s happening to your
data as you manipulate it. In this video, I’ll give you
a demonstration of pandas, and what it’s like to use it. Later, we’ll go into greater detail on its unique classes,
processes, and functions. First of all, you can load
data into pandas easily from different formats like comma-separated value files, or CSVs, Excel, and other spreadsheets,
databases, and more. Here, I’m loading a CSV file that I’m accessing via a web URL. The file contains information for some of the passengers
from the Titanic, including their names,
what class ticket they had, their age, ticket price, and cabin number. By the way, this table of
data is called a dataframe. The dataframe is a core
data structure in pandas. Notice that the dataframe is
made up of rows and columns, and it can contain data of
many different data types including integers, floats,
strings, booleans, and more. If I want to calculate the
average age of the passengers, we do so by selecting the age column and calling the mean method on it. I can also get the max, min, and standard deviation
with minimal effort. I can also quickly check how many passengers were in each class. Checking summary statistics
of the entire dataset only requires one line of code. This method gives me the number of rows as well as the mean, standard deviation, minimum and maximum values, along with the quartiles
for every numeric column. These concepts are all covered in greater depth elsewhere in the program. For now, I just want you
to pay close attention to the power of pandas and all that you can accomplish with it. Pandas also allows me to filter based on simple or complex logic. For example, here I’m selecting only the third class passengers
who were older than 60. In addition to all of
these data analysis tools, pandas also gives us ways to
manipulate and change the data. For example, I can add
a column that represents the inflation adjusted price
of a ticket from 1912 to 2023. Florence Briggs Thayer paid 71.28 pounds for her first class ticket. Today, that ticket would have cost her 10,417 pounds sterling. If you’re wondering
how I knew her name was Florence Briggs Thayer, it’s because I can also
select rows, columns, or individual cells from
the data using indexing. Her name is in row one, column three. I can also do more complex data
groupings and aggregations. For example, here I’m grouping the
passengers by class and sex, and then calculating the mean cost of a ticket for each group. Hopefully you’re excited to
start working with pandas. I know I’m looking forward to
guiding you as you learn more about this powerful and
fun data analysis tool.
Video: pandas basics
Main Points:
- Pandas is a powerful library for data analysis and manipulation.
- It has two main object classes: DataFrames and Series.
- DataFrames are two-dimensional, labeled data structures with rows and columns.
- Series are one-dimensional, labeled arrays.
- DataFrames can be created from various data sources like dictionaries, NumPy arrays, and CSV files.
- They have many useful methods and attributes for accessing and manipulating data.
- Selecting and referencing parts of a DataFrame is done using bracket notation or dot notation.
- iloc is used for integer-location-based selection, while loc is used for selection by label.
- New columns can be added to a DataFrame using simple assignment statements.
Key Takeaways:
- DataFrames are the primary data structure in pandas.
- Series are useful for representing individual columns or rows.
- Pandas offers various methods for creating and manipulating DataFrames.
- Indexing and slicing are used to select parts of a DataFrame.
- iloc and loc provide different ways for selecting based on location or label.
- DataFrames are flexible and powerful tools for data analysis.
Additional Notes:
- NaN represents null values in pandas.
- Series with mixed data types are classified as “object” type.
- Dot notation is preferred for simple selections, while brackets are better for complex code.
- The pandas documentation is a valuable resource for learning more about its features and capabilities.
Conclusion:
This tutorial provides a solid introduction to pandas DataFrames and Series. It covers basic creation, manipulation, and selection techniques. By understanding these key concepts, you can begin working with pandas effectively and explore its vast potential for data analysis. Remember, the documentation is always available to guide you as you advance your pandas skills.
Introduction:
Welcome to the exciting world of pandas! This tutorial will guide you through the essential concepts of pandas, equipping you with the foundational knowledge to work effectively with data in Python.
1. What is Pandas?
Pandas is a powerful and popular open-source library in Python specifically designed for data analysis and manipulation. It offers a wide range of functionalities, including:
- Data structures: Pandas provides efficient data structures like DataFrames and Series for organizing and managing tabular data.
- Data manipulation: Powerful methods and functionalities for cleaning, transforming, and analyzing data.
- Data analysis: Tools for statistical analysis, visualization, and exploration.
2. Core Data Structures:
- DataFrame: A two-dimensional, labeled data structure with rows and columns. Think of it as a spreadsheet or a SQL table.
- Series: A one-dimensional, labeled array. Often used to represent individual columns or rows of a DataFrame.
3. Creating DataFrames:
There are several ways to create DataFrames:
- From dictionaries: Keys become column names and values become list elements in each column.
- From NumPy arrays: Arrays are converted to DataFrames with rows and columns.
- From CSV files: Pandas provides the
read_csv
function to import data from CSV files. - From other data sources: Pandas supports various data sources like JSON, Excel files, and databases.
4. Accessing Data:
- Column selection: Use bracket notation with column names or dot notation (for simple names).
- Row selection: Use integer-based indexing (
iloc
) or label-based indexing (loc
). - Accessing specific values: Use bracket notation with two indices (row and column).
5. Basic Operations:
- Data cleaning: Handling missing values, formatting data types, and removing outliers.
- Data manipulation: Filtering, sorting, merging, and joining DataFrames.
- Data analysis: Descriptive statistics, group-by operations, and time series analysis.
6. Essential Methods:
head()
: Shows the first few rows of a DataFrame.tail()
: Shows the last few rows of a DataFrame.info()
: Displays information about the DataFrame, including data types and null values.describe()
: Provides descriptive statistics for numerical columns.sort_values()
: Sorts a DataFrame by a specific column.groupby()
: Groups data by a specific column and performs operations on each group.
7. Visualization:
Pandas integrates seamlessly with matplotlib and seaborn for data visualization. You can create various charts and graphs to explore and understand your data.
8. Conclusion:
This tutorial serves as a starting point for your journey with pandas. As you practice and explore, you will discover its immense capabilities and become a proficient data analyst. Remember, the pandas documentation is your valuable resource for learning more about specific functionalities and advanced techniques.
Additional Resources:
- Official Pandas Documentation: https://pandas.pydata.org/docs/
- Pandas Tutorial: https://www.datacamp.com/tutorial/python
- Pandas Cheat Sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
Next Steps:
- Practice the concepts covered in this tutorial with real data sets.
- Explore more advanced pandas functionalities like data aggregation, time series analysis, and plotting.
- Learn about other libraries like matplotlib and seaborn for data visualization.
- Participate in online communities and forums to connect with other data enthusiasts and learn from their experiences.
Remember, the key to mastering pandas is continuous practice and exploration. So, keep learning, keep coding, and keep analyzing data!
Fill in the blank: In pandas a dataframe is a _____-dimensional, labeled data structure.
two
In pandas a dataframe is a two-dimensional, labeled data structure. A dataframe is organized into rows and columns.
Now that you have a good understanding of the core structures
and routines of Python, and some of the basics of NumPy, you’re ready to start working with pandas. Pandas is one of the primary
tools that you’ll use throughout the rest of
this certificate program, as well as in a large and growing number of data professions. In this video, you’ll learn about the main classes
in the pandas library and some important ways to work with them. Pandas has two core object
classes: dataframes and series. Let’s begin with a review of dataframes. A dataframe is a two-dimensional, labeled data structure with rows and columns. You can think of a dataframe like a spreadsheet or a SQL table. It can contain many
different kinds of data. Data professionals use dataframes
to structure, manipulate, and analyze data in pandas, just like we did in the previous video with the Titanic example. We can create a dataframe using the pandas DataFrame function. This function has a lot of flexibility, and can convert numerous data formats to a DataFrame object. In this example, we created a
dataframe from a dictionary, where each key of the dictionary
represents a column name, and the values for that key are in a list. Each element in the
list represents a value for a different row at that column. We can also create one from a NumPy array resembling a list of lists, where each sub-list
represents a row of the table. Notice that in this example, we included separate keyword arguments for columns and index. This approach lets us name the columns and rows of the dataframe. These are just a couple
of the many different ways to create a dataframe with
the DataFrame function. For examples of some
others, be sure to review the available pandas
documentation on this topic. Often, data professionals
need to be able to create a dataframe from existing data that’s not written in Python syntax. For example, maybe we want to take an existing spreadsheet and
manipulate it in pandas. Spreadsheets can be saved as CSV files, which can then be read
into pandas as a dataframe. CSV stands for comma-separated values, and it refers to a plaintext file that uses commas to separate distinct values from one another. Here is a sample of the first
few lines of source data from the Titanic dataset
that we used previously. This is what a CSV file looks like. In this file, you’ll find
values for passenger name, age, sex, fare, and more. Notice that a comma is used to separate each value from the next. To create a dataframe from a CSV file. pandas has the “read CSV” function. Here’s the same Titanic data
rendered as a dataframe. For the sake of an example,
it’s defined here as df3. The “read CSV” function
can read files from a URL, like in this example, and
it can also read files directly from your hard drive. Instead of a URL, you’d just provide the file path to your file. Now, let’s discuss the other
main class in pandas: Series. A Series is a one-dimensional,
labeled array. Series objects are most often used to represent individual
columns or rows of a dataframe. So, if we select a row or a column from this Titanic dataframe
and call “type” on it, it will return as a pandas series object. Like dataframes, individual series can be created from various data objects, including from NumPy arrays, dictionaries, and even scalars. Again, refer to the pandas
documentation for examples. Now, let’s use the Titanic dataset to review some of the basics of working with dataframes and series. The DataFrame and Series classes have many super useful
methods and attributes that make common tasks easier. Remember, a method is a function
that belongs to a class. It performs an action on the object. An attribute is a value
associated with a class instance. It typically represents a
characteristic of the instance. Both methods and attributes are
accessed using dot notation, but methods use parentheses,
while attributes do not. Earlier in the video, we
named the Titanic dataset “df3,” but let’s change the name
to “titanic” for clarity. We can do this by simple reassignment. If we want to access the
“columns” of the dataframe, we can use the columns attribute. This returns an index of
all of the column names. We can use the shape attribute to check the number of rows and columns
contained in the dataframe. This dataframe has 891
rows and 12 columns. And we can get some summary
information about the dataframe by calling the info method. This tells us that there
are 891 rows and 12 columns, and it also gives us the column names, the data type contained in each column, the number of non-null
values in each column, and the amount of memory
the dataframe uses. By the way, I want to
address a couple of points about terminology in pandas. First, null values in pandas
are represented by NaN, which stands for “not a number.” And second, if a Series object contains mixed or string data types, when you check its data type, it will come back as an “object.” This is an example of how
pandas is built on NumPy, but the details of this are
beyond the scope of this video. One of the most common
tasks when working in pandas is selecting or referencing
parts of the dataframe. This has many similarities
with indexing and slicing. For example, if you want
to select a single column, you can type the name of the dataframe followed by brackets,
and within the brackets enter the name of the column as a string. This returns a Series
object of that column. You can also use dot
notation, but this only works if the column name does not
contain any whitespaces. Using dot notation is faster to type, so for very simple lines of
code, you may prefer to do this. But if the code begins
to get more complex, it’s generally better
to use bracket notation, because it makes the code easier to read. To select multiple columns
of a dataframe by name, use bracket notation. Within the brackets, enter
a list of column names. This returns a view of your dataframe as a new DataFrame object. If you want to select
rows or columns by index, you’ll need to use iloc. iloc is a way to indicate in pandas that you want to select by
integer-location-based position. If you enter a single integer
into the iloc brackets, you’ll get a series object representing a single row of your
dataframe at that index. Because I entered 0 here, I got the very first row in
my dataframe as a series. If you enter a LIST of a single integer in the iloc brackets, you’ll
get a DataFrame object of a single row of the
dataframe at that index. You can access a range of rows by entering the indices of
the beginning and ending rows separated by a colon. Pandas will return every index starting with the beginning index up to, but not including, the last index. So zero colon three returns row indices 0, 1, and 2. You can select subsets of rows
and columns together, too. This returns a dataframe
view of rows 0, 1, and 2 at columns 3 and 4 only. So, if you want a single
column in its entirety, you select all rows, and then
enter the index of the column you want. And, you can
even get a single value at a particular row in a particular column by using two indices separated by a comma. Loc is similar to iloc,
but instead of selecting by index location, loc is used to select pandas rows and columns by name. Let’s investigate loc with
the Titanic dataframe… In this example, I’m
selecting rows 1, 2, and 3 at just the “Name” column. Note that in this example,
we’re referring to the rows with numbers, even though
we’re using loc to select. This is because our rows
are indexed by number. If we had a named index, however, we’d have to use row names, like what we’re doing for columns. And one more thing. If you want to add a new
column to a dataframe, you can do that with a
simple assignment statement. Now we have a new column at the end here. There are so many things
you can do in pandas, and I can only share so
much in the time we have. As always, the documentation
is your friend. There will inevitably be times where you need to do something that wasn’t explicitly covered here. In those cases, the
documentation almost always has simple examples that demonstrate how to do the thing you need to do. There’s still more to come though, so I’ll meet you soon.
Reading: The fundamentals of pandas
Reading
You’ve learned that Python has many open-source libraries and packages—including NumPy and pandas—that make it one of the most useful coding languages. In this reading, you will review the basics of pandas dataframes and learn more about how to work with them. Understanding the fundamentals of pandas is essential to becoming a capable and competent data professional.
Primary data structures
Pandas has two primary data structures: Series and DataFrame.
- Series: A Series is a one-dimensional labeled array that can hold any data type. It’s similar to a column in a spreadsheet or a one-dimensional NumPy array. Each element in a series has an associated label called an index. The index allows for more efficient and intuitive data manipulation by making it easier to reference specific elements of your data.
- DataFrame: A dataframe is a two-dimensional labeled data structure—essentially a table or spreadsheet—where each column and row is represented by a Series.
Create a DataFrame
To use pandas in your notebook, first import it. Similar to NumPy, pandas has its own standard alias, pd, that’s used by data professionals around the world:
import pandas as pd
Once you’ve imported pandas into your working environment, create a dataframe. Here are some of the ways to create a DataFrame object in a Jupyter Notebook.
From a dictionary:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
From a numpy array:
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2
a b c
0 1 2 3
1 4 5 6
2 7 8 9
From a comma-separated values (csv) file:
(Note that this cell will not run, but is provided to illustrate syntax.)
df3 = pd.read_csv('/file_path/file_name.csv')
Attributes and methods
The DataFrame class is powerful and convenient because it comes with a suite of built-in features that simplify common data analysis tasks. These features are known as attributes and methods. An attribute is a value associated with an object or class that is referenced by name using dotted expressions. A method is a function that is defined inside a class body and typically performs an action. A simpler way of thinking about the distinction between attributes and methods is to remember that attributes are characteristics of the object, while methods are actions or operations.
Common DataFrame attributes
Data professionals use attributes and methods constantly. Some of the most-used DataFrame attributes include:
Attribute | Description |
---|---|
columns | Returns the column labels of the dataframe |
dtypes | Returns the data types in the dataframe |
iloc | Accesses a group of rows and columns using integer-based indexing |
loc | Accesses a group of rows and columns by label(s) or a Boolean array |
shape | Returns a tuple representing the dimensionality of the dataframe |
values | Returns a NumPy representation of the dataframe |
Common DataFrame methods
Some of the most-used DataFrame methods include:
Method | Description |
---|---|
apply() | Applies a function over an axis of the dataframe |
copy() | Makes a copy of the dataframe’s indices and data |
describe() | Returns descriptive statistics of the dataframe, including the minimum, maximum, mean, and percentile values of its numeric columns; the row count; and the data types |
drop() | Drops specified labels from rows or columns |
groupby() | Splits the dataframe, applies a function, and combines the results |
head(n=5) | Returns the first n rows of the dataframe (default=5) |
info() | Returns a concise summary of the dataframe |
isna() | Returns a same-sized Boolean dataframe indicating whether each value is null (can also use isnull() as an alias) |
sort_values() | Sorts by the values across a given axis |
value_counts() | Returns a series containing counts of unique rows in the dataframe |
where() | Replaces values in the dataframe where a given condition is false |
These are just a handful of some of the most commonly used attributes and methods—there are many, many more! Some of them can also be used on pandas Series objects. For a more detailed list, refer to the pandas DataFrame documentation, which includes helpful examples of how to use each tool.
Selection statements
Once your data is read into a dataframe, you’ll want to do things with it by selecting, manipulating, and evaluating the data. In this section, you’ll learn how to select rows, columns, combinations of rows and columns, and basic subsets of data.
Row selection
Rows of a dataframe are selected by their index. The index can be referenced either by name or by numeric position.
loc[]
loc[] lets you select rows by name. Here’s an example:
df = pd.DataFrame({
'A': ['alpha', 'apple', 'arsenic', 'angel', 'android'],
'B': [1, 2, 3, 4, 5],
'C': ['coconut', 'curse', 'cassava', 'cuckoo', 'clarinet'],
'D': [6, 7, 8, 9, 10]
},
index=['row_0', 'row_1', 'row_2', 'row_3', 'row_4'])
df
A B C D
row_0 alpha 1 coconut 6
row_1 apple 2 curse 7
row_2 arsenic 3 cassava 8
row_3 angel 4 cuckoo 9
row_4 android 5 clarinet 10
The row index of the dataframe contains the names of the rows. Use loc[] to select rows by name:
print(df.loc['row_1'])
A apple
B 2
C curse
D 7
Name: row_1, dtype: object
Inserting just the row index name in selector brackets returns a Series object. Inserting the row index name as a list returns a DataFrame object:
print(df.loc[['row_1']])
A B C D
row_1 apple 2 curse 7
To select multiple rows by name, use a list within selector brackets:
print(df.loc[['row_2', 'row_4']])
A B C D
row_2 arsenic 3 cassava 8
row_4 android 5 clarinet 10
You can even specify a range of rows by named index:
print(df.loc['row_0':'row_3'])
A B C D
row_0 alpha 1 coconut 6
row_1 apple 2 curse 7
row_2 arsenic 3 cassava 8
row_3 angel 4 cuckoo 9
Note: Because you’re using named indices, the returned range includes the specified end index.
iloc[]
iloc[] lets you select rows by numeric position, similar to how you would access elements of a list or an array. Here’s an example.
print(df)
print()
print(df.iloc[1])
A B C D
row_0 alpha 1 coconut 6
row_1 apple 2 curse 7
row_2 arsenic 3 cassava 8
row_3 angel 4 cuckoo 9
row_4 android 5 clarinet 10
A apple
B 2
C curse
D 7
Name: row_1, dtype: object
Inserting just the row index number in selector brackets returns a Series object. Inserting the row index number as a list returns a DataFrame object:
print(df.iloc[[1]])
A B C D
row_1 apple 2 curse 7
To select multiple rows by index number, use a list within selector brackets:
print(df.iloc[[0, 2, 4]])
A B C D
row_0 alpha 1 coconut 6
row_2 arsenic 3 cassava 8
row_4 android 5 clarinet 10
Specify a range of rows by index number:
print(df.iloc[0:3])
A B C D
row_0 alpha 1 coconut 6
row_1 apple 2 curse 7
row_2 arsenic 3 cassava 8
Note that this does not include the row at index three.
Column selection
Bracket notation
Column selection works the same way as row selection, but there are also some shortcuts to make the process easier. For example, to select an individual column, simply put it in selector brackets after the name of the dataframe:
print(df['C'])
row_0 coconut
row_1 curse
row_2 cassava
row_3 cuckoo
row_4 clarinet
Name: C, dtype: object
And to select multiple columns, use a list in selector brackets:
print(df[['A', 'C']])
A C
row_0 alpha coconut
row_1 apple curse
row_2 arsenic cassava
row_3 angel cuckoo
row_4 android clarinet
Dot notation
It’s possible to select columns using dot notation instead of bracket notation. For example:
print(df.A)
row_0 alpha
row_1 apple
row_2 arsenic
row_3 angel
row_4 android
Name: A, dtype: object
Dot notation is often convenient and easier to type. However, it can make your code more difficult to read, especially in longer statements involving method chaining or condition-based selection. For this reason, bracket notation is often preferred.
loc[]
You can also use loc[] notation:
print(df)
print()
print(df.loc[:, ['B', 'D']])
A B C D
row_0 alpha 1 coconut 6
row_1 apple 2 curse 7
row_2 arsenic 3 cassava 8
row_3 angel 4 cuckoo 9
row_4 android 5 clarinet 10
B D
row_0 1 6
row_1 2 7
row_2 3 8
row_3 4 9
row_4 5 10
Note that when using loc[] to select columns, you must specify rows as well. In this example, all rows were selected using just a colon (:).
iloc[]
Similarly, you can use iloc[] notation. Again, when using iloc[], you must specify rows, even if you want to select all rows:
print(df.iloc[:, [1,3]])
B D
row_0 1 6
row_1 2 7
row_2 3 8
row_3 4 9
row_4 5 10
Select rows and columns
Both loc[] and iloc[] can be used to select specific rows and columns together.
loc[]
print(df.loc['row_0':'row_2', ['A','C']])
A C
row_0 alpha coconut
row_1 apple curse
row_2 arsenic cassava
Again, notice that when using loc[] to select a range, the final element in the range is included in the results.
iloc[]
print(df.iloc[[2, 4], 0:3])
A B C
row_2 arsenic 3 cassava
row_4 android 5 clarinet
Note that, when using rows with named indices, you cannot mix numeric and named notation. For example, the following code will throw an error:
rint(df.loc[0:3, ['D']])
Error on line 1:
print(df.loc[0:3, ['D']])
To view rows [0:3] at column ‘D’ (if you don’t know the index number of column D), you’d have to use selector brackets after an iloc[] statement:
# This is most convenient for VIEWING:
print(df.iloc[0:3][['D']])
# But this is best practice/more stable for assignment/manipulation:
print(df.loc[df.index[0:3], 'D'])
D
row_0 6
row_1 7
row_2 8
row_0 6
row_1 7
row_2 8
Name: D, dtype: int64
However, in many (perhaps most) cases your rows will not have named indices, but rather numeric indices. In this case, you can mix numeric and named notation. For example, here’s the same dataset, but with numeric indices instead of named indices.
df = pd.DataFrame({
'A': ['alpha', 'apple', 'arsenic', 'angel', 'android'],
'B': [1, 2, 3, 4, 5],
'C': ['coconut', 'curse', 'cassava', 'cuckoo', 'clarinet'],
'D': [6, 7, 8, 9, 10]
},
)
df
A B C D
0 alpha 1 coconut 6
1 apple 2 curse 7
2 arsenic 3 cassava 8
3 angel 4 cuckoo 9
4 android 5 clarinet 10
Notice that the rows are enumerated now. Now, this code will execute without error:
print(df.loc[0:3, ['D']])
D
0 6
1 7
2 8
3 9
Key takeaways
Pandas dataframes are a convenient way to work with tabular data. Each row and each column can be represented by a pandas Series, which is similar to a one-dimensional array. Both dataframes and series have a large collection of methods and attributes to perform common tasks and retrieve information. Pandas also has its own special notation to select data. As you work more with pandas, you’ll become more comfortable with this notation and its many applications in data science.
Resources for more information
Video: Boolean masking
The video discusses various methods of data selection in a pandas dataframe, focusing on filtering based on value-based conditions using Boolean masking. Boolean masking involves overlaying a Boolean grid onto a dataframe, selecting values aligned with True values. An example with a “planets” dataframe illustrates creating a boolean mask for planets with fewer than 20 moons. The tutorial demonstrates applying the mask and introduces logical operators for multiple conditions (and, or, not). The importance of using parentheses for logical statements is emphasized. The video concludes by highlighting the flexibility of pandas for data selection and encourages practice to master these techniques.
This tutorial provides a comprehensive overview of Boolean masking, a powerful technique for filtering data based on specific conditions in Python. It delves into the usage of Boolean masking with various data structures, including lists, NumPy arrays, and pandas DataFrames, offering practical examples and explanations to enhance your understanding.
1. Introduction to Boolean Masking:
Boolean masking leverages the concept of Boolean arrays in Python. These one-dimensional arrays contain True and False values corresponding to elements in the data structure being filtered. By manipulating these values, you can selectively choose the desired data points.
2. Filtering Lists with Boolean Masking:
Let’s consider a list of numbers:
Python
numbers = [1, 4, 5, 2, 7, 3]<br>
To filter and keep only the numbers greater than 4:
Python
# Create a mask with True for numbers > 4
mask = [number > 4 for number in numbers]
# Filter the list using the mask
filtered_numbers = [number for number, is_greater in zip(numbers, mask) if is_greater]
print(filtered_numbers) # Output: [4, 5, 7]
This code creates a mask with True for numbers greater than 4 and uses it to filter the original list, resulting in a new list containing only those elements.
3. Filtering NumPy Arrays with Boolean Masking:
Similarly, boolean masking can be applied to NumPy arrays:
Python
import numpy as np
data = np.array([1, 4, 5, 2, 7, 3])
# Create a mask with True for even numbers
mask = data % 2 == 0
# Filter the array using the mask
filtered_data = data[mask]
print(filtered_data) # Output: [1 2]
This code creates a mask for even numbers and uses it to filter the array, resulting in a new array containing only even elements.
4. Filtering Pandas DataFrames with Boolean Masking:
Boolean masking offers significant benefits with pandas DataFrames, enabling sophisticated data filtering based on specific column values.
Python
import pandas as pd
data = pd.DataFrame({
"name": ["Mary", "John", "Peter", "Alice", "David"],
"age": [25, 30, 28, 22, 32],
"city": ["New York", "London", "Paris", "Berlin", "Tokyo"]
})
# Create a mask for people older than 25
mask = data["age"] > 25
# Filter the DataFrame using the mask
filtered_data = data[mask]
print(filtered_data)
This code filters the DataFrame for people older than 25, demonstrating the filtering capabilities of Boolean masking with DataFrames.
5. Complex Boolean Expressions:
You can combine logical operators like and
, or
, and not
to construct complex filtering criteria:
Python
# Filter people in New York or Paris who are younger than 28
mask = (data["city"] == "New York") | (data["city"] == "Paris") & (data["age"] < 28)
filtered_data = data[mask]
print(filtered_data)
This code illustrates how to use multiple conditions and logical operators to achieve more specific filtering within a DataFrame.
6. Benefits of Boolean Masking:
Boolean masking offers several advantages:
- Efficient: Filtering large datasets is highly efficient with Boolean masking.
- Flexible: It adapts to various data structures and allows for complex filtering conditions.
- Readable: Code using Boolean masking is often clear and easy to understand.
7. Conclusion:
Boolean masking is a powerful and versatile technique for filtering data in Python. This tutorial provides a solid foundation for implementing Boolean masking with different data structures. Remember to practice and experiment to master this valuable tool for effective data analysis.
Previously, we investigated
several different ways of selecting data in a dataframe, including column selection, row selection, and selection of combinations
of rows and columns using name-based and
integer-based indexing. In this video, you’ll learn how to filter the data in the dataframe based on value-based conditions. You know that boolean is used to describe any binary variable with two possible values: true or false. With pandas, you can
use a powerful technique known as Boolean masking. Boolean masking is a filtering technique that overlays a Boolean
grid onto a dataframe in order to select only
the values in the dataframe that align with the
True values of the grid. Data professionals use
boolean masking all the time, so it’s important that you
understand how it works. Here’s an example. Suppose you have a dataframe
of planets, their radii, and the number of moons each planet has. Now, suppose that you
want to keep the rows of any planets that
have fewer than 20 moons and filter out the rest. A boolean mask is a panda Series object indicating whether this
condition is true or false for each value in the “moons” column. The data contained in
this series is type bool. Boolean masking effectively
overlays this boolean series onto the dataframe’s index. The result is that any
rows in the dataframe that are indicated as
True in the Boolean mask remain in the dataframe, and any row that are indicated
as False get filtered out. Here’s how to perform
this operation in pandas. We’ll begin by creating a dataframe from a predefined dictionary using the pandas DataFrame function. The dataframe is called “planets.” The next step is to
create the boolean mask by writing a logical statement. Remember, the objective is to keep planets that have fewer than 20 moons
and to filter out the rest. So, we define the mask by writing “planets at the moons
column is less than 20″. This results in a Series object that consists of the row indices where each index contains
a True or False value depending on whether that row
satisfies the given condition. This is the boolean mask. To apply this mask to the dataframe, insert it into selector brackets and apply it to the dataframe. It’s also possible to
apply the conditional logic directly to the dataframe,
skipping the part where we assign it to a
variable named “mask.” But breaking out the steps individually can make the code easier to follow. Note that we haven’t permanently
modified the dataframe. Applying the boolean mask
using the conditional logic only gives a filtered “view” of it. So, when you call the
planet’s variable again it returns the full dataframe. However, we can assign the
result to a named variable. This may be useful if you’ll need to reference the list of planets with moons under 20 again later. Sometimes you’ll need to filter data based on multiple conditions. Pandas uses logical operators to indicate which data to keep and which to filter out and statements that use
multiple conditions. These operators are:
The ampersand for “and” the vertical bar for “or”
and the tilde for “not”. Let’s review how this works. Here’s how to create a boolean mask that selects all planets that have fewer than 10 moons
or greater than 50 moons. Notice that each condition
is self-contained and a set of parentheses, and the two conditions are
separated by a vertical bar, which is the logical operator
that represents “or”. It’s very important that each component of the logical statement
be in the parenthesis. Otherwise, your statement
will throw an error, or worse, return something that
isn’t what you intended. To apply the mask, call the
dataframe and put the statement or the variable it’s assigned
to in selector brackets. Here’s an example of how
to select all planets that have more than 20 moons, but not planets with 80 moons and not planets with a radius
less than 50,000 kilometers. Let’s break it into pieces. First is a statement for planets. In parentheses, we put
“planets at the moons column must be greeter than 20”
and close the parenthesis. Then we use the “and” “not”
operators before parentheses “planets at the moons column
equals 80,” close parentheses. Then we again use the
“and” “not” operators before parentheses “planets at the radius
column less than 50,000,” close parentheses. When we apply the mass to the dataframe, we’re left with just one planet: Saturn, with 83 moons and a
radius of 58,232 kilometers. There are a near infinite number of ways to select and filter data
using the basic tools that you’ve learned so far. As always, it takes a lot of practice before you know exactly how to execute every selection statement
that your work requires. So make sure to bookmark
any and all resources that you find helpful to reference. Keep up the good work and I’ll
meet you in the next video.
Reading: Boolean masking in pandas
Reading
Now that you know how to select data in pandas by referring to rows and columns, the next step is to learn how to use Boolean masks. Data professionals use Boolean masks to select data in pandas based on conditions. In this reading, you will discover Boolean masking and how to use pandas’ logical operators to form multi-conditional selection statements. Understanding the fundamentals of pandas will help make your work as a data professional easier and more efficient.
Boolean masks
You know that Boolean is used to describe any binary variable whose possible values are true or false. With pandas, Boolean masking, also called Boolean indexing, is used to overlay a Boolean grid onto a dataframe’s index in order to select only the values in the dataframe that align with the True values of the grid.
Return to the example from the video. Suppose you have a dataframe of planets, their radii, and their number of moons:
planet | radius_km | moons |
---|---|---|
Mercury | 2,440 | 0 |
Venus | 6,052 | 0 |
Earth | 6,371 | 1 |
Mars | 3,390 | 2 |
Jupiter | 69,911 | 80 |
Saturn | 58,232 | 83 |
Uranus | 25,362 | 27 |
Neptune | 24,622 | 14 |
Now suppose that you want to keep the rows of any planets that have fewer than 20 moons and filter out the rest. A Boolean mask is a pandas Series object indicating whether this condition is true or false for each value in the moons column:
Moons < 20? | |
---|---|
0 | True |
1 | True |
2 | True |
3 | True |
4 | False |
5 | False |
6 | False |
7 | True |
The dtype contained in this series is bool. Boolean masking effectively overlays this Boolean series onto the dataframe’s index. The result is that any rows in the dataframe that are indicated as False in the Boolean mask get filtered out, and any rows that are indicated as True remain in the dataframe:
planet | radius_km | moons |
---|---|---|
Mercury | 2,440 | 0 |
Venus | 6,052 | 0 |
Earth | 6,371 | 1 |
Mars | 3,390 | 2 |
Neptune | 24,622 | 14 |
Coding Boolean masks in pandas
Here is how to perform this operation in pandas.
Begin with a DataFrame object.
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
25362, 24622],
'moons': [0, 0, 1, 2, 80, 83, 27, 14]
}
df = pd.DataFrame(data)
df
moons planet radius_km
0 0 Mercury 2440
1 0 Venus 6052
2 1 Earth 6371
3 2 Mars 3390
4 80 Jupiter 69911
5 83 Saturn 58232
6 27 Uranus 25362
7 14 Neptune 24622
Then, write a logical statement. Remember, the objective is to keep planets that have fewer than 20 moons and filter out the rest.
print(df['moons'] < 20)
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 True
Name: moons, dtype: bool
This results in a Series object of dtype: bool that consists of the row indices, where each index contains a True or False value depending on whether that row satisfies the given condition. This is the Boolean mask. To apply this mask to the dataframe, simply insert this statement into selector brackets and apply it to your dataframe:
print(df[df['moons'] < 20])
moons planet radius_km
0 0 Mercury 2440
1 0 Venus 6052
2 1 Earth 6371
3 2 Mars 3390
7 14 Neptune 24622
You can also assign the Boolean mask to a named variable and then apply that to your dataframe:
mask = df['moons'] < 20
df[mask]
moons planet radius_km
0 0 Mercury 2440
1 0 Venus 6052
2 1 Earth 6371
3 2 Mars 3390
7 14 Neptune 24622
Note that this doesn’t permanently modify your dataframe. It only gives a filtered view of it.
df
moons planet radius_km
0 0 Mercury 2440
1 0 Venus 6052
2 1 Earth 6371
3 2 Mars 3390
4 80 Jupiter 69911
5 83 Saturn 58232
6 27 Uranus 25362
7 14 Neptune 24622
However, you can assign the result to a named variable:
mask = df['moons'] < 20
df2 = df[mask]
df2
moons planet radius_km
0 0 Mercury 2440
1 0 Venus 6052
2 1 Earth 6371
3 2 Mars 3390
7 14 Neptune 24622
And if you want to select just the planet column as a Series object, you can use regular selection tools like loc[]:
mask = df['moons'] < 20
df.loc[mask, 'planet']
0 Mercury
1 Venus
2 Earth
3 Mars
7 Neptune
Name: planet, dtype: object
Complex logical statements
In statements that use multiple conditions, pandas uses logical operators to indicate which data to keep and which to filter out. These operators are:
Operator | Logic |
---|---|
& | and |
| | or |
~ | not |
Important: Each component of a multi-conditional logical statement must be in parentheses. Otherwise, the statement will throw an error or, worse, return something that isn’t what you intended.
For example, here is how to create a Boolean mask that selects all planets that have fewer than 10 moons or greater than 50 moons:
mask = (df['moons'] < 10) | (df['moons'] > 50)
mask
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 False
Name: moons, dtype: bool
Notice that each condition is self-contained in a set of parentheses, and the two conditions are separated by the logical operator, |(or). To apply the mask, call the dataframe and put the statement or the variable it’s assigned to in selector brackets:
mask = (df['moons'] < 10) | (df['moons'] > 50)
df[mask]
moons planet radius_km
0 0 Mercury 2440
1 0 Venus 6052
2 1 Earth 6371
3 2 Mars 3390
4 80 Jupiter 69911
5 83 Saturn 58232
Here’s an example of how to select all planets that have more than 20 moons, but not planets with 80 moons and not planets with a radius less than 50,000 km:
mask = (df['moons'] > 20) & ~(df['moons'] == 80) & ~(df['radius_km'] < 50000)
df[mask]
moons planet radius_km
5 83 Saturn 58232
Note that this returns the same result as the following:
mask = (df['moons'] > 20) & (df['moons'] != 80) & (df['radius_km'] >= 50000)
df[mask]
moons planet radius_km
5 83 Saturn 58232
Working with pandas dataframes, using their attributes and methods, and selecting data using Boolean masks are some of the core daily activities of a data professional. You’ll soon be using these tools often as you progress on your journey with pandas.
Key takeaways
A Boolean mask is a method of applying a filter to a dataframe. The mask overlays a Boolean grid over your dataframe in order to select only the values in the dataframe that align with the True values of the grid. To create Boolean comparisons, pandas has its own logical operators. These operators are:
- & (and)
- | (or)
- ~ (not)
Each criterion of a multi-conditional selection statement must be enclosed in its own set of parentheses. With practice, making complex selection statements in pandas is possible and efficient.
Resources for more information
Video: Grouping and aggregation
This video dives into grouping and aggregating data in pandas using the groupby
and agg
methods.
Key takeaways:
groupby
groups rows based on values in one or more columns, allowing further analysis.- Applying methods like
sum
,mean
, orcount
to groupby objects aggregates data within each group. - Multiple columns can be used for grouping and aggregation.
agg
allows applying multiple calculations to groups simultaneously.- Custom functions can be defined and used within
agg
for specific calculations.
Benefits of grouping and aggregating:
- Gain deeper insights into data patterns and relationships.
- Prepare data for visualization and analysis.
- Summarize large datasets efficiently.
Real-world applications:
- Analyzing financial data by customer segments
- Understanding website traffic by device type
- Exploring scientific data by experiment parameters
Overall:
groupby
and agg
are powerful tools for data analysis, enabling you to uncover hidden trends and make informed decisions based on your data.
Remember:
- Start with small examples to understand the mechanics.
- Explore and experiment to discover the full potential of these methods.
Pandas, the Python library for data analysis, empowers you to unlock hidden stories within your data. One of its most powerful tools is the ability to group and aggregate data, allowing you to see the big picture and understand trends across different categories. This tutorial will equip you with the knowledge and skills to master this essential skill.
1. Understanding GroupBy:
At its core, groupby
groups rows in your DataFrame based on shared values in one or more columns. Imagine sorting your data like books on a shelf, with each shelf representing a group. This allows you to analyze and compare data within each group, revealing patterns and relationships not readily visible in the raw data.
2. Putting GroupBy into Action:
Let’s say you have a DataFrame of customer purchases, with columns like product, price, and customer ID. Here’s how you can use groupby
:
Python
# Group by product
grouped_by_product = df.groupby("product")
# Calculate total sales for each product
total_sales = grouped_by_product["price"].sum()
# Print the result
print(total_sales)
This code groups the DataFrame by the “product” column and then calculates the total sales for each product using the sum
function.
3. Exploring Aggregation Functions:
groupby
isn’t just about sums. Pandas provides a plethora of aggregation functions like mean
, median
, max
, min
, and count
to analyze your data groups in various ways. You can even use custom functions for specific calculations.
4. Leveling Up with Multiple Columns:
For even deeper analysis, you can group by multiple columns. For example, you can group by both product and customer ID to understand individual customer preferences for different products.
5. Aggregating Multiple Functions:
agg
comes in handy when you want to apply multiple functions to each group. Imagine calculating the average price and total number of purchases for each product. You can achieve this with:
Python
grouped_by_product.agg({"price": ["mean", "sum"]})
This code returns a DataFrame with two columns for each product: average price and total sales.
6. Real-World Applications:
Grouping and aggregation are used in various domains:
- Finance: Analyze spending patterns across customer segments.
- Marketing: Understand website traffic by device type and campaign.
- Science: Compare experimental results across different conditions.
7. Beyond the Basics:
This tutorial provides a foundational understanding. As you progress, explore advanced concepts like:
- Hierarchical grouping for nested analysis.
- Chaining operations for efficient data manipulation.
- Applying custom functions for unique insights.
Remember: Practice is key! Experiment with your data and discover the powerful stories hidden within groups. With consistent practice, you’ll become a master of grouping and aggregation, unlocking the full potential of your data analysis in Pandas.
Additional Resources:
- Pandas documentation: https://pandas.pydata.org/docs/
- TutorialsPoint: https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm
- DataCamp: https://campus.datacamp.com/courses/customer-analytics-and-ab-testing-in-python/key-performance-indicators-measuring-business-success?ex=9
Start your journey today and unlock the secrets of your data!
Now that you’ve learned how to select and filter data using name- and location-based indexing, as well as boolean masking, it’s time to take the next step. In this video, you’ll learn how to group
your data, aggregate it, and perform calculations
on these groupings to help you discover what
the data is telling you. One of the most important and commonly used tools to group data in pandas is the groupby method. Groupby is a pandas DataFrame method that groups rows of the dataframe together based on their values
at one or more columns, which allows further
analysis of the groups. Let’s return to our planets dataset to demonstrate some different ways to use the groupby method. This time we’ll add a little more data, including the type of planet, whether or not it has a ring system, average temperature in degrees Celsius, and whether or not it has
a global magnetic field. As always, when learning a new data tool, it’s helpful to begin by applying it to a small example. This will better enable you to understand exactly what’s happening. First, let’s examine the mechanics of what happens when you use groupby. When you call the groupby
method on a dataframe, it creates a groupby object. If you do nothing else, the groupby object isn’t very helpful. You’ll basically get a statement saying, “Here’s your object. “It’s stored at this address
in the computer’s memory.” But once you have this object, there are all kinds of
things you can do with it. For example, if we group the dataframe by the “type” column and then apply the sum method to the groupby object, the computer returns a dataframe object with three rows, one for each planet type, and three columns, one for each numerical column. Only the numerical columns are returned because the sum method only works on numerical data. The “type” column is an
index of this dataframe. This information can be interpreted as the sum of all the values in each group at these respective columns. So, for example, radii of all the gas giant planets sum to 128,143 kilometers. That information probably isn’t very useful in most cases, but the total number of moons could definitely be something
we want to calculate. If you want to isolate the information at particular columns, just insert the columns as a list in selector brackets following
the groupby statement. You can also use other
methods instead of sum. For example, min, max,
mean, median, count, and many others. Groupby will work on multiple columns too. When we pass a list containing the type and magnetic field columns to the groupby method and then apply the mean
method to the result, we get a dataframe that contains a row for each unique combination of planet type and magnetic field. Again, the columns contain the mean calculated values for each
numerical column for each group. Groupby is very useful because it helps you to better
understand your data. It’s also useful to organize data that you want to plot on a graph. You’ll learn more about this later. Another important method to use on the groupby objects is the agg method. Agg is short for “aggregate.” This method allows you to apply multiple calculations to groups of data. Let’s start simple. What if we want to group the planets by their type, and then calculate both the mean and the median values of the numeric columns for each group? We call the agg method
after the groupby statement. In its argument field, we enter a list of the calculations we want to apply to the data. If these calculations are existing methods of groupby objects, they can just be entered as strings. We can group by multiple columns and apply multiple aggregation functions to each group. For instance, we can group the planets by type and whether or not they
have a magnetic field, and then use the agg method to calculate the mean and max values of each group. And we can even define our own functions and apply them. For example, suppose we want to calculate the 90th percentile of each group. We can define a function called “percentile 90” that
uses the quantile method on the array and returns the value
at the 90th percentile. Then we can call this custom
function in our aggregation. Notice that we can enter “mean” as a string because
it’s an existing method of groupby objects, but we type the “percentile 90” function as an object because it’s custom-defined. Groupby and aggregate are two tools that together can give deep insight into the story that your
data is telling you. The types of calculations that we just reviewed are daily tasks of data professionals
in nearly every field. Even though we only applied them to a very tiny dataset, these same exact operations would work on a dataset of every
planet in the galaxy, if we knew them all and if we had enough computing power to perform the aggregations! There’s a lot more we can do with groupby and aggregate, and as always I encourage you to explore more on your own, but you should now have
a solid understanding of how and when to apply these tools.
Reading: More on grouping and aggregation
Reading
You’ve discovered that pandas is a Python library that facilitates reviewing and manipulating tabular data. In addition, groupby() and agg() are essential DataFrame methods that data professionals use to group, aggregate, summarize, and better understand data. In this reading, you’ll review how these functions work, as well as when and how to apply them.
groupby()
The groupby() function is a method that belongs to the DataFrame class. It works by splitting data into groups based on specified criteria, applying a function to each group independently, then combining the results into a data structure. When applied to a dataframe, the function returns a groupby object. This groupby object serves as the foundation for different data manipulation operations, including:
- Aggregation: Computing summary statistics for each group
- Transformation: Applying functions to each group and returning modified data
- Filtration: Selecting specific groups based on certain conditions
- Iteration: Iterating over groups or values
Here are some examples that use the groupby() function on a dataframe consisting of different articles of clothing:
clothes = pd.DataFrame({'type': ['pants', 'shirt', 'shirt', 'pants', 'shirt', 'pants'],
'color': ['red', 'blue', 'green', 'blue', 'green', 'red'],
'price_usd': [20, 35, 50, 40, 100, 75],
'mass_g': [125, 440, 680, 200, 395, 485]})
clothes
color mass_g price_usd type
0 red 125 20 pants
1 blue 440 35 shirt
2 green 680 50 shirt
3 blue 200 40 pants
4 green 395 100 shirt
5 red 485 75 pants
Grouping the dataframe by type results in a DataFrameGroupBy object:
grouped = clothes.groupby('type')
print(grouped)
print(type(grouped))
<pandas.core.groupby.DataFrameGroupBy object at 0x7f59c2a44ef0>
<class 'pandas.core.groupby.DataFrameGroupBy'>
However, an aggregation function can be applied to the groupby object:
grouped = clothes.groupby('type')
grouped.mean()
mass_g price_usd
type
pants 270.0 45.000000
shirt 505.0 61.666667
In the preceding example, groupby() combined all the items into groups based on their type and returned a DataFrame object containing the mean of each group for each numeric column in the dataframe. Note: In future versions of pandas it will be necessary to specify a numeric_only parameter when applying certain aggregation functions—like mean—to a groupby object. numeric_only refers to the datatype of each column. In earlier versions of pandas (like the version on this platform) it isn’t necessary to specify numeric_only=True, but in future versions this must be done. Otherwise, it will be necessary to indicate the specific columns to be captured.)
In addition, groups may be created based on multiple columns:
clothes.groupby(['type', 'color']).min()
mass_g price_usd
type color
pants blue 200 40
red 125 20
shirt blue 440 35
green 395 50
In the preceding example, groupby() was called directly on the clothes dataframe. The data was grouped first by type, then by color. This resulted in four groups—the number of different existing combinations of values for type and color. Then, the min() function was applied to the result to filter each group by its minimum value.
To simply return the number of observations there are in each group, use the size() method. This will result in a Series object with the relevant information:
clothes.groupby(['type', 'color']).size()
type color
pants blue 1
red 2
shirt blue 1
green 2
dtype: int64
Built-in aggregation functions
The previous examples demonstrated the mean(), min(), and size() aggregation functions applied to groupby objects. There are many available built-in aggregation functions. Some of the more commonly used include:
- count(): The number of non-null values in each group
- sum(): The sum of values in each group
- mean(): The mean of values in each group
- median(): The median of values in each group
- min(): The minimum value in each group
- max(): The maximum value in each group
- std(): The standard deviation of values in each group
- var(): The variance of values in each group
agg()
The agg() function is useful when you want to apply multiple functions to a dataframe at the same time. agg() is a method that belongs to the DataFrame class. It stands for “aggregate.” Its most important parameters are:
- func: The function to be applied
- axis: The axis over which to apply the function (default= 0).
Following are some examples of how agg() can be used. Note that they demonstrate how this function can be used by itself (without groupby()). Note also that, due to platform limitations, some of the following code blocks are not executable. In these cases, output is provided as an image. Here is the original clothes dataframe again as a reminder:
clothes
color mass_g price_usd type
0 red 125 20 pants
1 blue 440 35 shirt
2 green 680 50 shirt
3 blue 200 40 pants
4 green 395 100 shirt
5 red 485 75 pants
The following example applies the sum() and mean() functions to the price and mass_g columns of the clothes dataframe.
clothes[['price_usd', 'mass_g']].agg(['sum', 'mean'])
Output:
![](https://i0.wp.com/stackfolio.xyz/wp-content/uploads/2023/12/output-1.png?resize=235%2C122&ssl=1)
Notice the following:
- The two columns are subset from the dataframe before applying the agg() method. If you don’t subset the relevant columns first, agg() will attempt to apply sum() and mean() to all of the columns, which wouldn’t work because some columns contain strings. (Technically, sum() would work, but it would return something useless because it would just combine all the strings into one long string.)
- The sum() and mean() functions are entered as strings in a list, without their parentheses. This will work for any built-in aggregation function.
In this next example, different functions are applied to different columns.
clothes.agg({'price_usd': 'sum',
'mass_g': ['mean', 'median']
})
Output:
![](https://i0.wp.com/stackfolio.xyz/wp-content/uploads/2023/12/output-2.png?resize=243%2C146&ssl=1)
Notice the following:
- Columns are not subset from the dataframe before applying the agg() function. This is unnecessary because the columns are specified within the agg() function itself.
- The argument to the agg() function is a dictionary whose keys are columns and whose values are the functions to be applied to those columns. If multiple functions are applied to a column, they are entered as a list. Again, each built-in function is entered as a string without parentheses.
- The resulting dataframe contains NaN values where a given function was not designated to be used.
The following example applies the sum() and mean() functions across axis 1. In other words, instead of applying the functions down each column, they’re applied over each row.
clothes[['price_usd', 'mass_g']].agg(['sum', 'mean'], axis=1)
![](https://i0.wp.com/stackfolio.xyz/wp-content/uploads/2023/12/output-3.png?resize=145%2C246&ssl=1)
groupby() with agg()
The groupby() and agg() functions are often used together. In such cases, first apply the groupby() function to a dataframe, then apply the agg() function to the result of the groupby. For reference, here is the clothes dataframe once again.
clothes
color mass_g price_usd type
0 red 125 20 pants
1 blue 440 35 shirt
2 green 680 50 shirt
3 blue 200 40 pants
4 green 395 100 shirt
5 red 485 75 pants
In the following example, the items in clothes are grouped by color, then each of those groups has the mean() and max() functions applied to them at the price_usd and mass_g columns.
clothes.groupby('color').agg({'price_usd': ['mean', 'max'],
'mass_g': ['mean', 'max']})
price_usd mass_g
mean max mean max
color
blue 37.5 40 320.0 440
green 75.0 100 537.5 680
red 47.5 75 305.0 485
MultiIndex
You might have noticed that, when functions are applied to a groupby object, the resulting dataframe has tiered indices. This is an example of MultiIndex. MultiIndex is a hierarchical system of dataframe indexing. It enables you to store and manipulate data with any number of dimensions in lower dimensional data structures such as series and dataframes. This facilitates complex data manipulation.
This course will not require any deep knowledge of hierarchical indexing, but it’s helpful to be familiar with it. Consider the following example:
grouped = clothes.groupby(['color', 'type']).agg(['mean', 'min'])
grouped
mass_g price_usd
mean min mean min
color type
blue pants 200.0 200 40.0 40
shirt 440.0 440 35.0 35
green shirt 537.5 395 75.0 50
red pants 305.0 125 47.5 20
Notice that color and type are positioned lower than the column names in the output. This indicates that color and type are no longer columns, but named row indices. Similarly, notice that price_usd and mass_g are positioned above mean and min in the output of column names, indicating a hierarchical column index.
If you inspect the row index, you’ll get a MultiIndex object containing information about the row indices:
grouped.index
MultiIndex(levels=[['blue', 'green', 'red'], ['pants', 'shirt']],
labels=[[0, 0, 1, 2], [0, 1, 1, 0]],
names=['color', 'type'])
The column index shows a MultiIndex object containing information about the column indices:
grouped.columns
MultiIndex(levels=[['mass_g', 'price_usd'], ['mean', 'min']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
To perform selection on a dataframe with a MultiIndex, use loc[] selection and put indices in parentheses. Here are some examples on grouped, which is a dataframe with a two-level row index and a two-level column index. For reference, here is the grouped dataframe:
grouped
mass_g price_usd
mean min mean min
color type
blue pants 200.0 200 40.0 40
shirt 440.0 440 35.0 35
green shirt 537.5 395 75.0 50
red pants 305.0 125 47.5 20
To select a first-level (top) column:
grouped.loc[:, 'price_usd']
mean min
color type
blue pants 40.0 40
shirt 35.0 35
green shirt 75.0 50
red pants 47.5 20
To select a second-level (bottom) column:
grouped.loc[:, ('price_usd', 'min')]
color type
blue pants 40
shirt 35
green shirt 50
red pants 20
Name: (price_usd, min), dtype: int64
To select first-level (left-most) row:
grouped.loc['blue', :]
mass_g price_usd
mean min mean min
type
pants 200.0 200 40.0 40
shirt 440.0 440 35.0 35
To select a bottom-level (right-most) row:
grouped.loc[('green', 'shirt'), :]
mass_g mean 537.5
min 395.0
price_usd mean 75.0
min 50.0
Name: (green, shirt), dtype: float64
And you can even select individual values:
grouped.loc[('blue', 'shirt'), ('mass_g', 'mean')]
440.0
If you want to remove the row MultiIndex from a groupby result, include as_index=False as a parameter to your groupby() statement:
clothes.groupby(['color', 'type'], as_index=False).mean()
color type mass_g price_usd
0 blue pants 200.0 40.0
1 blue shirt 440.0 35.0
2 green shirt 537.5 75.0
3 red pants 305.0 47.5
Notice how color and type are no longer row indices, but named columns. The row indices are the standard enumeration beginning from zero.
Again, you will not be expected to do any complex manipulations of hierarchically indexed data in this course, but it’s helpful to have a basic understanding of how MultIndex works, especially because groupby() manipulations typically result in a MultiIndex dataframe by default.
Key takeaways
groupby() will be an essential function in your work as a data professional, as it enables efficient combining and analysis of data. Similarly, agg() will help you apply multiple functions dynamically across a specified axis of a dataframe. Either on their own or when used together, these tools give data professionals deep access to data and help bring about successful projects.
Video: Merging and joining data
Concatenation (concat):
- Combines dataframes horizontally (new columns) or vertically (new rows).
- Useful for adding additional data that shares the same format as the existing dataframe.
- Uses “axis” keyword to specify vertical (0) or horizontal (1) concatenation.
- Example: Combining dataframes with planet information (radius, moons) vertically.
Merge:
- Joins two dataframes based on a shared key (e.g., planet name).
- Different types of joins:
- Inner: Only keeps rows with keys present in both dataframes.
- Outer: Includes all keys from both dataframes, filling missing values with NaN.
- Left: Includes all keys from the left dataframe and matching keys from the right.
- Right: Includes all keys from the right dataframe and matching keys from the left.
- Example: Combining dataframes with planet information and additional details (type, rings, temperature) using various join types.
Key Takeaways:
- Choose concat for adding data vertically with the same format.
- Use merge for joining data based on shared keys, specifying the desired join type.
- Understanding these tools is crucial for data analysis and manipulation in pandas.
Pandas, the data analysis workhorse, excels at wrangling and manipulating data. One of its most powerful features is the ability to merge and join dataframes, allowing you to combine information from multiple sources into a single, unified dataset. This tutorial will equip you with the essential knowledge to conquer these tasks with confidence.
Understanding the Basics:
Merging and joining are related concepts, but they have subtle differences:
- Merging: Combines dataframes based on a shared key (e.g., customer ID).
- Joining: Combines dataframes based on a relationship between them (e.g., orders and products).
Both operations result in a single dataframe containing information from the combined sources.
Choosing the Right Tool:
- Merge: Use a merge when you need to match and combine data based on a specific key. For example, merging customer data with order data based on customer ID.
- Join: Use a join when you need to combine data based on a broader relationship. For example, joining product information with a sales dataframe based on product ID.
Merging with pd.merge
:
pd.merge
is the workhorse function for merging dataframes. It takes several arguments:
left
: The left dataframe (the one considered “primary”).right
: The right dataframe (the one to be merged).on
: The column(s) used as the key for matching data.how
: The type of join (inner, outer, left, right).
Joining with pd.concat
:
pd.concat
is typically used for joining dataframes vertically (stacking them) rather than horizontally (merging them). However, you can achieve specific joins using:
axis=1
: Joins dataframes horizontally (useful for related data with different keys).join
: Similar tohow
inpd.merge
, specifies the type of join (inner, outer).
Exploring Join Types:
- Inner join: Keeps rows only where keys exist in both dataframes.
- Outer join: Keeps all rows from both dataframes, filling missing values with NaN.
- Left join: Keeps all rows from the left dataframe and matching rows from the right.
- Right join: Keeps all rows from the right dataframe and matching rows from the left.
Real-World Examples:
- Merging customer data with purchase history.
- Joining product information with customer reviews.
- Combining financial data from multiple sources.
Tips and Tricks:
- Use descriptive column names to avoid confusion.
- Check for duplicate rows after merging.
- Handle missing values appropriately.
- Explore advanced merge options like
suffixes
for differentiating columns.
Practice Makes Perfect:
The best way to master these techniques is through practice. Start with small datasets and experiment with different types of joins and merge options. Don’t be afraid to get creative and explore the power of combining data to unlock valuable insights!
Additional Resources:
- Pandas documentation: https://realpython.com/pandas-merge-join-and-concat/
- Tutorials and code examples: https://realpython.com/pandas-merge-join-and-concat/
- Tips and tricks for data wrangling: https://www.tomasbeuzen.com/python-programming-for-data-science/chapters/chapter9-wrangling-advanced.html
Remember, merging and joining data in Pandas is a powerful tool that can unlock the hidden potential within your data. By understanding the basics, experimenting, and practicing, you’ll be well on your way to becoming a data wrangling champion!
Hello again! You’ve learned a lot about pandas, that it’s a powerful library that makes working with tabular data easier and more efficient, how to select and index
data in a dataframe, how to filter data using Boolean masks, and how to group and aggregate data to derive insights from
the story it’s telling. In this video, you’ll learn how to add new data to existing dataframes. This is a common task
for data professionals, but it’s not as simple as just adding two dataframes together. There are important
considerations to be aware of. By the end of this lesson, you’ll have a good understanding of what these considerations are so you can make informed decisions about how best to add
data to your project. We’re going to learn about
two pandas functions: concat and merge. There’s considerable overlap between the capabilities of these functions, but it’s most important that you learn the basics of each because you will encounter them
regularly as a data professional. We’ll start with the concat function. Recall that to concatenate means to link or join together. The pandas concat function combines data either by adding it horizontally as new columns for existing rows, or vertically as new rows
for existing columns. It’s also capable of handling many data-specific complexities that arise, which allows for a high
degree of user control. In this video, I’ll demonstrate how to
use the concat function to add new rows to existing columns, but remember, there’s plenty
of support documentation if you’d like additional information. Pandas has a specific way to indicate which way we want the
data to be concatenated. We do this by referring to axes. In fact, many pandas and NumPy functions have an “axis” keyword so you can specify whether you want to apply the function
across rows or down columns. The two axes of a dataframe are zero, which runs vertically over rows; and one, which runs
horizontally across columns. We’ll use our basic planets dataset to demonstrate how concat works. This data has four planets, their radii, and their number of moons, but it’s missing the data for Jupiter, Saturn, Uranus, and Neptune. Now suppose we want to add this data, which exists as a separate dataframe. Let’s examine this second dataset with information about
Jupiter, Saturn, Uranus, and Neptune before joining them. Notice that this data
is in the same format as the data in the df1 dataframe. It has the same columns for
planet, radius, and moons. To combine the two dataframes, we’ll want to add df2
as new rows below df1. To concatenate the first dataset with information about
Mercury, Venus, Earth, and Mars with the second, which has information about Jupiter, Saturn, Uranus, and Neptune, we call PD concat and insert a list of the dataframes we want to concatenate. Then we need to include
an axis keyword argument. This instructs the function to combine the data either side-by-side or one on top of the other. We want our resulting dataframe to have eight rows and three columns, which means we want to combine the data vertically. In other words, we want to add new data
by extending axis zero, the vertical axis. Perfect! The data was added as new rows. Notice that each row retains its index number from
its original dataframe. If you want the numbering to restart, just reset the index. We include the “drop equals true” argument because otherwise a new index column will be added to the dataframe, which we don’t want in this case. Now the enumeration of the row indices goes from zero to seven. The concat function is great for when you have dataframes containing identically formatted data that simply needs to be combined vertically. If you want to add data horizontally, consider the merge function. The merge function is a pandas function that joins two dataframes together. It only combines data by extending along axis one horizontally. Let’s return to the planets. Now we have the radius and number of moons for all eight planets, but suppose we want to add the data for the planet type, whether it has rings, its average temperature, whether it has a magnetic field, and whether it has life on it. Perhaps this data exists
as a separate dataframe, but it’s missing Mercury and Venus and it has some recently
discovered planets from other star systems,
Janssen and Tadmor. That’s okay. We can still work with this. First, let’s conceptualize how data joins work. For two datasets to connect, they need to share a
common point of reference. In other words, both datasets must have
some aspect of them that is the same in each one. These are known as keys. Keys are the shared points of reference between different dataframes, what to match on. In our case, the keys are the planets. Each dataframe contains
planets for us to match on. Now let’s consider the different ways that we can join this data. We can join it so only the keys that are in both dataframes get included in the merge. This is known as an inner join. Alternatively, we can join the data so all of the keys from both dataframes get included in the merge. This is known as an outer join. We can also join the
data so all of the keys in the left dataframe are included, even if they aren’t in
the right dataframe. This is called a left join. Finally, we can join
the data so all the keys in the right dataframe are included, even if they aren’t in
the left dataframe. This is called a right join. Let’s examine how each type of join affects our planet data. First we’ll call the function and enter df3 and df4 as the left and right positional
arguments, respectively. Then we include the keyword argument “on,” which lets us specify what our keys to match on should be. In this case, we want to
use the “planet” column. Now we have the “how” keyword argument. This is where we enter
the kind of join we want. Let’s try “inner” first. This merged the data and only kept the planets that appeared in both dataframes. This means we’re missing data for Mercury and Venus from the left dataframe as well as for Janssen and Tadmor from the right dataframe. Now let’s try an outer join. Our function call will remain the same except for the
“how” keyword argument, which we’ll set to “outer.” As expected, this results in a dataframe that contains all the keys from both initial dataframes. Notice that, because Janssen and Tadmor aren’t in the left dataframe, they don’t have information
for radius and moons, so these columns get filled in with NaNs. Similarly, because Mercury and Venus aren’t in the right dataframe, they too are missing some information in the final table, which is represented by NaNs. Next we’ll do a left join. Again, the function gets the same syntax except for the “how” argument, which is set to “left.” This results in a dataframe that retains all the keys from the left dataframe and only the keys from
the right dataframe that exist in the left dataframe too. So Janssen and Tadmor are excluded. Finally, we’ll perform a right join. As expected, the result is a dataframe that has all the keys
from the right dataframe, but none of the keys from the left that weren’t also in the right. So Mercury and Venus are excluded. Nice job! Now that you know the fundamentals, you can use these pandas tools to do the most common kinds of data joins, which will be useful for a wide variety of data projects. And as you advance in your career, you’ll discover even
more about joining data, and how it can get very complex. These tools will be a big help as you do. You’ve come a long way and are now ready to start using pandas to explore your data like
a true data professional. See you soon!
Lab: Exemplar: Dataframes with pandas
Practice Quiz: Test your knowledge: Dataframes with pandas
Fill in the blank: In pandas, a _____ is a one-dimensional, labeled array.
series
A series is a one-dimensional, labeled array. Series objects are most often used to represent individual columns or rows of a dataframe.
AI
The answer is: series.
A Pandas Series is a one-dimensional, labeled array that can hold data of any type (integers, strings, floats, etc.). Each data point in the Series is associated with a label, making it easy to access and work with specific data points.
The other options are not correct:
- Key: In Pandas, a key is used to access specific data points within a DataFrame or Series.
- DataFrame: A DataFrame is a two-dimensional, labeled data structure that can hold multiple Series with common labels.
- CSV file: A CSV file is a type of file format used to store tabular data, but it isn’t directly related to Pandas data structures.
I hope this helps! Let me know if you have any other questions.
In pandas, what is Boolean masking used for?
Filtering data in a dataframe
In pandas, Boolean masking is used for filtering data in a dataframe. Boolean masking is a filtering technique that overlays a Boolean grid onto a dataframe in order to select only the values in the dataframe that align with the True values of the grid.
AI
The correct answer is: Filtering data in a dataframe.
Boolean masking in Pandas uses a Boolean expression to select specific rows or columns in a DataFrame based on whether the expression evaluates to True
or False
for each data point. This allows you to focus on a subset of your data that meets certain criteria.
Here’s why the other options are not correct:
- Deleting data from a dataframe: While you can use Boolean masking to identify rows or columns for deletion, you’d need additional steps like using the
drop()
method to actually remove the data. - Merging data in a dataframe: Merging data typically involves combining two or more DataFrames, and while Boolean masking can be used in some advanced merging techniques, it’s not its primary purpose.
- Adding data to a dataframe: Boolean masking selects existing data, not adding new data to the DataFrame.
Therefore, the main use of Boolean masking in Pandas is to filter data based on specific conditions, providing a powerful tool for exploring and analyzing your data.
I hope this clarifies the role of Boolean masking in Pandas! Feel free to ask if you have any further questions.
What is a pandas method that groups rows of a dataframe together based on their values at one or more columns?
groupby()
groupby() is a pandas method that groups rows of a dataframe together based on their values at one or more columns. This allows further analysis of the groups.
AI
The correct answer is: groupby().
The groupby()
method in Pandas is specifically designed to group rows of a DataFrame together based on their values at one or more columns. It creates a pandas.core.groupby.GroupBy
object, which allows you to perform various operations on the grouped data, such as calculating statistics, filtering subsets, or transforming groups.
Here’s a breakdown of the other options:
- agg(): This method applies an aggregation function (e.g., mean, sum, count) to each group in the DataFrame and returns the results in a new DataFrame or Series. While it can be used with grouped data, it’s not the method for actually grouping the data by itself.
- keys(): This method returns the labels of the groups in the
GroupBy
object. It’s useful for accessing the group names after grouping the data. - values(): This method returns an iterator that iterates over the values (data) in each group. It’s useful for iterating through grouped data but doesn’t perform any grouping itself.
Therefore, groupby()
is the best choice for grouping rows of a DataFrame together based on their values at one or more columns. It provides a powerful and flexible way to analyze and manipulate data based on group characteristics.
I hope this clarifies the difference between these methods! Let me know if you have any other questions about Pandas or data analysis.
A data professional wants to join two dataframes together. The dataframes contain identically formatted data that needs to be combined vertically. What pandas function can the data professional use to join the dataframes?
concat()
The data professional can use the concat() function to join the dataframes. concat() is a pandas function that combines data either by adding it horizontally as new columns for existing rows, or vertically as new rows for existing columns.
AI
The best pandas function for joining two DataFrames with identically formatted data vertically is:
concat()
Here’s why:
concat()
: This function is specifically designed for concatenating multiple DataFrames or Series objects together. It can join data vertically (axis=0) or horizontally (axis=1). In this case, since the data needs to be combined vertically, you would useaxis=0
.insert()
: This function inserts a new column or row into an existing DataFrame at a specific location. While it can be used to add one DataFrame’s data as a new row to another, it’s not as efficient or straightforward asconcat()
for vertical concatenation.merge()
: This function is typically used for joining DataFrames based on a shared key or column values. While you can technically use it for vertical concatenation by merging on a dummy column with identical values in both DataFrames, it’s a more complex approach compared to the simpleconcat()
.type()
: This function simply returns the type of an object. It’s not relevant to joining DataFrames.
Therefore, concat(axis=0)
is the most appropriate and efficient function for joining two identically formatted DataFrames vertically in pandas.
I hope this helps! Let me know if you have any other questions about pandas or data manipulation.
Review: Data structures in Python
Video: Wrap-up
Major Learnings:
- Data Structures: Storing, accessing, and organizing data effectively using lists, tuples, dictionaries, sets, and arrays.
- NumPy: Utilizing its computational power for rapid data processing and calculations.
- Pandas: Analyzing tabular data through tasks like filtering, grouping, and merging.
Significance:
- Understanding data structures is crucial for efficient data analysis.
- NumPy and pandas are essential tools for data professionals.
- Pandas will be your companion throughout the program and future career.
Next Steps:
- Prepare for the graded assessment by reviewing new terms, videos, and readings.
- Feel free to revisit any resource to solidify your understanding.
Overall Message:
Congratulations on mastering this section! You’ve built a strong foundation in Python that will empower you to succeed as a data professional. Keep up the fantastic progress!
This is the end of the fourth section of the Python course. You now have a strong
foundation of Python skills that you can continue to build on throughout your future career
as a data professional. In this section of the course, you learned how data
professionals use data structures to store, access, and organize their data. Understanding which data structures fit your specific task is
a key part of data work, and will help you analyze your data with speed and efficiency. We’ve reviewed fundamental data structures that are super useful
for data professionals: lifts, tuples, dictionaries,
sets, and arrays. We also discussed two
of the most widely-used and important Python tools
for advanced data analysis. The first was NumPy, which data professionals use
for its computational power. You learned how NumPy can
help you rapidly process large amounts of data and
perform useful calculations. The second Python tool you
learned about was pandas, which is a powerful tool
for analyzing tabular data. You learn how pandas can
help you perform key tasks such as filtering,
grouping, and merging data. Data professionals often
work with tabular data. You’ll use pandas throughout the rest of the certificate program…
and your future career. Coming up, you have a graded assessment. To prepare, review the reading that lists all the new terms you’ve learned. And feel free to revisit videos, readings, and other resources
that cover key concepts. Congratulations on all your progress. Way to go!
Reading: Glossary terms from module 4
Terms and definitions from Course 2, Module 4
agg(): A pandas groupby method that allows the user to apply multiple calculations to groups of data
Aliasing: A process that allows the user to assign an alternate name—or alias—to something
append(): A method that adds an element to the end of a list
Boolean masking: A filtering technique that overlays a Boolean grid onto a dataframe in order to select only the values in the dataframe that align with the True values of the grid
concat(): A pandas function that combines data either by adding it horizontally as new columns for existing rows or vertically as new rows for existing columns
CSV file: A plaintext file that uses commas to separate distinct values from one another; Stands for “comma-separated values”
Data structure: A collection of data values or objects that contain different data types
DataFrame: A two-dimensional, labeled data structure with rows and columns
dict(): A function used to create a dictionary
Dictionary: A data structure that consists of a collection of key-value pairs
difference(): A function that finds the elements present in one set but not the other
dtype: A NumPy attribute used to check the data type of the contents of an array
Global variable: A variable that can be accessed from anywhere in a program or script
groupby(): A pandas DataFrame method that groups rows of the dataframe together based on their values at one or more columns, which allows further analysis of the groups
iloc[]: A type of notation in pandas that indicates when the user wants to select by integer-location-based position
Immutability: The concept that a data structure or element’s values can never be altered or updated
Import statement: A statement that uses the import keyword to load an external library, package, module, or function into the computing environment
Inner join: A way of combining data such that only the keys that are in both dataframes get included in the merge
insert(): A function that takes an index as the first parameter and an element as the second parameter, then inserts the element into a list at the given index
intersection(): A function that finds the elements that two sets have in common
items(): A dictionary method to retrieve both the dictionary’s keys and values
Keys: The shared points of reference between different dataframes
keys(): A dictionary method to retrieve only the dictionary’s keys
Left join: A way of combining data such that all of the keys in the left dataframe are included, even if they aren’t in the right dataframe
Library: A reusable collection of code; also referred to as a “package”
List: A data structure that helps store and manipulate an ordered collection of items
List comprehension: Formulaic creation of a new list based on the values in an existing list
loc[]: Notation that is used to select pandas rows and columns by name
matplotlib: A library for creating static, animated, and interactive visualizations in Python
merge(): A pandas function that joins two dataframes together; it only combines data by extending along axis one horizontally
Module: A simple Python file containing a collection of functions and global variables
Mutability: The ability to change the internal state of a data structure
N-dimensional array: The core data object of NumPy; also referred to as “ndarray”
NaN: How null values are represented in pandas; stands for “not a number”
ndim: A NumPy attribute used to check the number of dimensions of an array
Nested loop: A loop inside of another loop
NumPy: An essential library that contains multidimensional array and matrix data structures and functions to manipulate them
Outer join: A way of combining data such that all of the keys from both dataframes get included in the merge
pandas: A powerful library built on top of NumPy that’s used to manipulate and analyze tabular data
pop(): A method that extracts an element from a list by removing it at a given index
remove(): A method that removes an element from a list
reshape(): A NumPy method used to change the shape of an array
Right join: A way of combining data such that all the keys in the right dataframe are included—even if they aren’t in the left dataframe
Seaborn: A visualization library based on matplotlib that provides a simpler interface for working with common plots and graphs
Sequence: A positionally ordered collection of items
Series: A one-dimensional, labeled array where the data type must be the same for all the data in a given series
Set: A data structure in Python that contains only unordered, non-interchangeable elements
set(): A function that takes an iterable as an argument and returns a new set object
shape: A NumPy attribute used to check the shape of an array
symmetric_difference(): A function that finds elements from both sets that are mutually not present in the other
Tabular data: Data that is in the form of a table, with rows and columns
Tuple: An immutable sequence that can contain elements of any data type
tuple(): A function that transforms input into tuples
type(): A function used to identify the type of data in a list
union(): A function that finds all the elements from both sets
values(): A dictionary method to retrieve only the dictionary’s values
Vectorization: A process that enables operations to be performed on multiple components of a data object at the same time
Quiz: Module 4 challenge
Fill in the blank: In Python, _____ indicate where a list starts and ends.
square brackets
AI
The correct answer is: square brackets.
In Python, square brackets []
indicate where a list starts and ends. For example:
Python
my_list = ["apple", "banana", "cherry"]
Quotation marks are used for strings, braces for dictionaries, and parentheses for tuples.
A data professional is working with a list named cities that contains data on global cities. What Python code can they use to add the string ‘Tokyo’ to the end of the list?
cities.append(‘Tokyo’)
AI
The correct answer is: cities.append(‘Tokyo’).
In Python, the append
method is used to add an element to the end of a list. Therefore, to add the string “Tokyo” to the end of the list named cities
, you would use the following code:
Python
cities.append("Tokyo")
The other options are incorrect:
cities.pop('Tokyo')
: This method removes the element at the specified index from the list and returns it. Since we want to add “Tokyo” to the list, not remove it, this is not the right choice.cities.import('Tokyo')
: Python doesn’t have animport
method for modifying lists. Theimport
keyword is used to import modules or functions.cities.insert('Tokyo')
: This method inserts an element at a specific index within the list. While it could be used to add “Tokyo” to the list, specifying an index isn’t necessary in this case as we want to simply add it to the end.
Which of the following statements accurately describe Python tuples? Select all that apply.
Tuples are sequences. Tuples are immutable. Tuples can be split into separate variables.
AI
The correct statements that accurately describe Python tuples are:
- Tuples are sequences: Like lists, tuples hold an ordered collection of elements.
- Tuples are immutable: Unlike lists, tuples cannot be modified or changed after creation.
Therefore, the following statements are incorrect:
- Tuples can be split into separate variables: While you can extract elements from a tuple into separate variables using assignment, the tuple itself stays unchanged.
- Tuples cannot be split into separate variables: You can indeed split a tuple into separate variables using assignment, effectively unpacking its elements.
Which of the following statements accurately describe Python dictionaries? Select all that apply.
Dictionaries consist of collections of key-value pairs.
AI
The correct statements that accurately describe Python dictionaries are:
- Dictionaries consist of collections of key-value pairs: Each element in a dictionary is associated with a unique key and a corresponding value.
- Dictionaries are instantiated with curly braces: You create a dictionary by enclosing key-value pairs within curly braces (
{}
).
Therefore, the following statements are incorrect:
- Dictionaries consist of string-tuple pairs: While keys are often strings, values can be of any Python data type, not just tuples.
- Dictionaries are instantiated with quotation marks: Quotation marks are used for strings, not to create dictionaries.
- Dictionaries are instantiated with the dict() function: While you can also use the
dict()
function to create dictionaries, it’s not the only way. As mentioned, using curly braces with key-value pairs is also valid.
I hope this clarifies the key characteristics of Python dictionaries!
A data professional is working with a dictionary named employees that contains employee data for a healthcare company. What Python code can they use to retrieve only the dictionary’s values?
employees.values()
AI
The correct answer is: employees.values().
Here’s why:
employees.values()
: This method returns a list containing all the values from the dictionaryemployees
.items.employees()
: This wouldn’t be valid syntax. We want to access the methods of theemployees
dictionary, not introduce anotheritems
variable.values.employees()
: Similar to the previous point, this wouldn’t make sense grammatically. We’re calling a method on theemployees
object, not an independentvalues
object.employees.items()
: This would return a list of key-value pairs as tuples, not just the values themselves.
Therefore, employees.values()
is the most appropriate option for retrieving only the dictionary’s values as a list.
I hope this helps! Let me know if you have any other questions.
A data professional is working with two Python sets. What function can they use to find the elements present in one set, but not the other?
difference()
AI
The correct answer is: difference().
Here’s why:
- intersection(): This function would return the elements that are present in both sets. In this case, we want the elements unique to one set.
- symmetric_difference(): This function returns the elements that are present in either set but not in both. While this might seem like the right answer, it returns elements from both sets that don’t overlap, not just from one specific set.
- union(): This function would return all the elements present in either set, including duplicates. We again want elements unique to one set.
- difference(): This function is exactly what we need. It returns the elements that are present in the first set but not in the second set.
Therefore, difference()
is the best function to use for finding elements present in one set but not the other.
Fill in the blank: In Python, _____ typically contain a collection of functions and global variables.
modules
AI
The correct answer is: modules.
In Python, modules typically contain a collection of functions and global variables. They enable code organization and reusability, allowing you to import specific functionalities into your program instead of writing everything from scratch.
Let’s break down the other options:
- Keywords: These are reserved words with specific meanings in Python, like
if
,for
, andwhile
. They’re not containers for functions and variables. - Comparators: These symbols like
==
,!=
, and<
perform comparisons between values. They’re not used for grouping functions and variables. - Logical operators: These symbols like
and
,or
, andnot
combine boolean values. They’re not containers for code either.
Therefore, “modules” is the most appropriate choice for representing collections of functions and global variables in Python.
I hope this clarifies the explanation! Let me know if you have any other questions.
Fill in the blank: A _____ NumPy array can be created from a list of lists, where each internal list is the same length.
two-dimensional
AI
The correct answer is: two-dimensional.
A NumPy array can be created from a list of lists, where each internal list is the same length, because this structure represents a two-dimensional grid. The first outer list represents the rows, and each inner list represents the columns within that row.
Here’s why the other options are incorrect:
- Three-dimensional: If you had a list of lists of lists, where each inner list could have different lengths, then you would have a three-dimensional array.
- Four-dimensional: You would need at least three levels of nested lists for a four-dimensional array.
- One-dimensional: A one-dimensional array would simply be a flat list of elements, not a list of lists.
Therefore, when each internal list in a list of lists has the same length, it signifies a two-dimensional structure suitable for creating a two-dimensional NumPy array.
I hope this explanation clarifies the concept! Let me know if you have any other questions.
A data professional is working with a pandas dataframe named sales that contains sales data for a retail website. They want to know the price of the least expensive item. What code can they use to calculate the minimum value of the Price column?
sales[‘Price’].min()
AI
The correct code to calculate the minimum value of the “Price” column is:
Python
sales['Price'].min()
Here’s why:
sales
: This refers to the entire dataframe.'Price'
: This accesses the “Price” column specifically..min()
: This method applied to the column returns the minimum value within that column.
Therefore, sales['Price'].min()
directly targets the “Price” column and then extracts its minimum value.
Let’s analyze the other options:
sales = ‘Price’.min()
: This wouldn’t be valid syntax. You cannot assign the min value to the entire dataframe directly.sales.min().Price
: This would first get the minimum value of the entire dataframe (including potentially non-numeric columns) and then attempt to access the “Price” attribute of that minimum value, which wouldn’t make sense.sales.min().(Price)
: This attempts to call a method named “Price” on the minimum value returned bymin()
, which wouldn’t exist.
I hope this explanation clarifies the reasoning behind the correct answer! Feel free to ask if you have any further questions.
A data professional is working with a pandas dataframe. They want to select a subset of rows and columns by index. What method can they use to do so?
iloc[]
AI
The correct method for selecting a subset of rows and columns by index in a pandas dataframe is:
loc[]
Here’s why:
iloc[]
: This method allows selection by integer location, which might not be the desired approach if you want to select by specific index labels.concat()
: This method is used for joining dataframes together, not for subsetting data within a single dataframe.loc[]
: This method allows flexible selection by label or integer location for both rows and columns. You can specify specific index labels for both axes within the brackets, making it ideal for selecting a precise subset based on indexes.merge()
: This method is used for combining dataframes based on a common column, not for subsetting data within a single dataframe.
Therefore, loc[]
provides the most efficient and versatile way to select rows and columns by index in a pandas dataframe.
A data professional wants to merge two pandas dataframes. They want to join the data so all of the keys in the left dataframe are included—even if they are not in the right dataframe. What technique can they use to do so?
Left join
AI
The correct technique for merging two pandas dataframes while retaining all keys from the left dataframe, even if missing in the right dataframe, is:
Left join
Here’s why:
- Right join: This approach keeps all keys from the right dataframe and only includes entries from the left dataframe that have matching keys in the right dataframe. This wouldn’t achieve the desired outcome of preserving all left dataframe keys.
- Inner join: This method only keeps rows with matching keys in both dataframes, which wouldn’t include rows from the left dataframe with unmatched keys in the right dataframe.
- Outer join: This is the most general term for merging dataframes, encompassing different types of joins like left, right, and inner. However, without specifying the specific type of outer join, it’s unclear whether all left dataframe keys would be preserved.
- Left join: This specifically keeps all rows from the left dataframe and fills in missing matches from the right dataframe with null values. This perfectly fits the requirement of retaining all left dataframe keys even if they’re absent in the right dataframe.
Therefore, utilizing a left join ensures that all information from the left dataframe remains while incorporating relevant matches from the right dataframe.
I hope this explanation clarifies the different join options and why a left join is the optimal choice in this scenario. Do let me know if you have any further questions about dataframes or join techniques!