Skip to content
Home » University of Michigan » Python for Everybody Specialization » Using Databases with Python » Week 3: Data Models and Relational SQL

Week 3: Data Models and Relational SQL

In this section we learn about how data is stored across multiple tables in a database and how rows are linked (i.e., we establish relationships) in the database.


Lectures


Video: 15.4 – Designing a Data Model

Problem: Designing an Efficient Music Application

  • You want to build an app to sell music tracks, but the initial user interface design has redundant data (e.g., artist name in every row). This could make the application slow.

Solution: Relational Database Design

  1. Identify the Core Entity: The most essential thing your application manages is ‘tracks’, so that becomes your first table.
  2. Separate Redundant Data: Columns with repeated data (artist, album, genre) indicate they should likely become their own tables.
  3. Model Relationships: Decide how things connect based on real-world logic and what your UI needs. In this case:
    • An album has many tracks.
    • An artist has many albums.
    • A track can have one genre.
  4. The Importance of the Data Model: This design you’ve created dictates how the application will work and its efficiency. Getting the model right is critical.

Key Principles

  • Don’t Replicate String Data: Repeating text is inefficient. Instead, use relationships between tables (more on this in upcoming videos).
  • Model the Real World: Your tables should reflect the things your application manages (artists, albums, etc.).

So now we move from sort of the SQL and
the contracts we need to talk to a single table, to linking
multiple tables together, and this is where the real power comes,
and this is where the complexity comes. So up until now, it should be real simple. When you start a company, we’re going to
kinda pretend we’re going to start a small company to build an
application to manage music. When you start a company, you’re going to build an application,
lots of applications need databases. And you need to design the database for
your application. And often this is very much a team
effort with a lot of different people with skills. Some people worry about the user,
end user capabilities, some people worry about the
performance, etc. And so database design,
what I’ll show you coming here is just the beginnings of database
design, but the basics are very powerful. And it’s something where smart people
should take time in database design. And so it’s a very collaborative process and if you ultimately look at
the database design for applications, we draw these crazy pictures, right? And in these crazy
pictures we are capturing, this is basically multiple tables. Table, table, table, table, and this particular database
is kind of about an events, like, you know, reserving the picnic
shelter at a park, right? What times, who’s going to be there,
who reserved it? All these other things. And what we’re starting to show is that
how these tables are connected together. So each of these little arrows,
so this is a table, and these are the columns in the table, and
then some of the columns are special columns that represent
relationships with other tables. And we’re not going to go into all the detail,
you can eventually go into some of the detail where you see what this little
arrow means and that means one and this means many etc., etc.. And this there’s all this fancy stuff and
so we’ll talk about this eventually. But this is sort of like what
we’re going for, we’re going for a picture of how we want our application’s
data to be stored in the database. And this then becomes our schema and
it’s the contract. So, we made one thing with two,
we made one table with two columns, and that was a schema and a contract, but now we’re going to make multiple
tables with many columns and some columns that are specifically
there to connect from one to another. So if you’re working in a large
application, you might walk in and see a picture like this on the wall. And you take a look, whoa, that’s really
important, that must be really complex! And it, for this application,
it probably represents years of clever engineering to make
sure that the application runs well. But this is basically how this,
I don’t know what it is, open mecker medical records system,
I just kind of borrowed the picture. This is how the data is
stored in its database. And these things can be very complex and the project I work on called Sakai,
it’s probably four times bigger than this. But if you zoom in on it, which you can’t,
but it’s just a table and some columns and some connections. And then, oh yeah, there’s a table,
well that one’s kind of complex. Here’s a table, some columns, and
two connections. So, yes, it looks complex on the surface, but ultimately we’re just trying to figure out
which tables we are going to make, what we’re going to put in those tables, and then how
we’re going to connect the tables together. And the connections are the thing
that make these things so powerful. We could just put all these
data in one file, but then this thing would
run like terribly slow. And so the trade-off of thinking
through how your data is going to look is that when you’re done, it’s fast. A lot of times we don’t worry too much
about how fast your program’s going to run. But when it comes to scanning data, especially if it’s a lot of data,
you think about that a lot. So the whole idea is to figure out
the data that you need to represent, and then drawing a picture, and
then lines between those pictures, right? And the basic rule that
we’re going to use is don’t put the same string data in twice. So for example, if we have a column of
something, don’t put Chuck in twice. So that’s bad. So if you have some column and you’re
replicating the same string data twice, that’s not good. What you want to do is you want
to make another table and put a Chuck over here and
give a number to Chuck like 1. And then put 1, 1, so to indicate
that something belongs to Chuck, which means you’re modeling data at a
connection between one table and another. We’ll go through this
in super great detail. So the basic rule is don’t put
the same string data in twice, use a relationship instead. And the other thing is sort
of like model the real world. If you have users and tracks and
Christmas tree farms and whatever, you’ll have a table for the Christmas tree
farms and you’ll have another table for the Christmas tree types and
you’ll have another table for whatever. And so often you’re starting to
sort of build an application and so the application that our little company
is going to build is an application keeping track of audio. Our company has decided that people
don’t use albums anymore and aren’t interested in
buying whole CDs of music. So we’re going to sell music by the track.
I think this is a great idea, and I think we’re all going to get
rich with this little company. And this is the user
interface that I invented, or I screenshotted from another vendor’s. But, whatever this is, we are going to
this is the program we’re going to build. We’re going to make a track-making thing. Now, just looking at this user interface
we see some problems right away. And often, if you just try to
turn this into a spreadsheet, these things would become the problem. And the problem has to do with
replicated data in columns. And so, it’s like great, this is all cool, until you have the same
artist name in the columns. That could be once or hundreds, because
Black Sabbath has written hundreds and hundreds of tracks, and then even in
the album column, that’s a problem. And then it gets even worse when
you’re in the genre column, you’re putting the string
in a zillion times. And if you’ve been writing Python programs
they kind of finish really fast, and that’s because the only assignments
I’ve given you are tiny. But when you’re going to do a million or a trillion things, the difference
between metal and the number 2, 2, 2, 2, 2, to replicate this
is actually very significant. Because this isn’t just a six, or five
character string, this could be a very long string, and so in your data you
have to allow for very long strings. And so these are the problems right there,
all this replication. Now it turns out that we have
a user interface person in the room, and the user interface person we say like,
hi, we’re a database expert. We took a class online and we know that
you’re not supposed to replicate data and the user interface person says, oh no. This replicated data,
we’ve done surveys and that’s exactly what the user wants to see. And you can’t argue with that. If that’s what the user wants to see,
that’s what they see. But we still have to write
an efficient application. And so that’s where we start going
through a data modeling exercise. Where we say, this is the kind of stuff we
want to represent, and this is the kind of user interface we want to support,
how can we build a really good data model? So it’s not like you’ve got to change
the user interface to make it good, you change the data model then to
represent the stuff we want, and then construct it in a way that the user
interface is exactly what the user wants. So the idea is you look at the data
that your application is going to look, and we’ve got all these columns, and you say is this column represent
a thing in the real world? Or is it just another
attribute of a thing? So there’s like a thing and then
attributes of the thing, or two things. And so you’ve got to go across all of the
columns, and so that’s what we’re going to do, we are going to look at all
the columns and we are arguing. So right now imagine we’re in a conference
room, we’re sitting around a table. And we’re going to draw this picture,
and this picture looks like the thing, it’s got little boxes and lines, right? We’re going to draw a boxes and lines picture, and
we have a blank screen right now. So the first question, actually, in these
meetings is commonly where to start. Because you’ve got a lot of stuff.
In our example, it’s really simple. And it turns out that it’s not so
critical where you start. Even if you start at the wrong place,
eventually you’re just going to have this web of information, and it’s all connected,
and it’ll all work out if you do it right. But it does simplify the drawing of your
picture if you start at the right place. And so, the way I was taught, to start
at the right place, was to think about the thing that is the most
essential to this application. What is the one-sentence
description of this application? And in this application,
it’s a thing that manages tracks. It’s not an album-managing thing,
it’s not a artist-managing thing, or a genre-managing thing,
it’s a track-managing thing. And we kind of see that
in our user interface. We see that every line is a track.
And so that makes it easy. The first table we’re going to
build is the track table. And once we build the track table, then we
have to look at all the other things and say, look, which of these
things are themselves tables, and which of these things
are just attributes of track. Okay? And so it turns out that
things like numbers, and like star is just a number, that’s five,
these are all fives, apparently I only like stuff or hate it,
and then this is the number of plays. And so you go like oh, that’s part of it,
that’s part of a track, that’s part of a track. Well, and now we come to the three things
that aren’t part of tracks because they have the replicated data, that’s the red flag
that says uh-uh, not part of a track. So, somebody gets up to the board, grabs
a piece of white board and they go like that’s our first table, it’s a track table and
its going to have a rating, length, and a count. I guess there should be a title in
there as well for this column here, so we have a title field, title. Okay, so now we’re good, right?
We’ve got that, right, we’ve got that. The question is, what’s connected to that? Right? Well,what is the thing that’s
the next thing to draw? Well, tracks could be connected to
artists, they could be connected to albums, so let’s just say,
kind of albums have many tracks. So, the next thing we’re going to do
is we’re going to make an album table. And we’re going to say on our white board,
we’re just drawing on a board right now, we’re not writing code yet. That tracks belong to albums and now it’s
kind of easy that what albums belong to, well albums belong to like groups. Now if you’re a music expert you kind of
understand that we’re oversimplifying things, and that’s a fine argument to
have when you’re starting your company, but we’re going to pretend the world is
simple, and that tracks belong to albums, that belong to artists. So we’re almost done with our meeting. So we got artists, albums, and tracks, and we’ve drawn a picture. And the question is,
where does genre belong to? What does genre connect to? Does genre connect to an album? Does genre connect to an artist? Or does genre connect to a track? Now you might want to go to your iTunes and
go to a track and change its genre. Change it to easy listening. Now if the track was, I mean if
the genre was connected to the artist, I mean this is a album, sorry. It means that it would then immediately
change all those to easy listening because that means that albums have
an attribute of what their genre is. If it was the same for an artist, that means that all your AC/DC
would be changed to easy listening. And if it was an attribute of track, only this one would be
changed to easy listening. So the question is, when you go into
that system and you change one thing, does it change all of these based
on the album or the artist? Well, you can pause and
go ahead and do that but, I’m going to tell you, it turns out if
you change that it only changes one. So that actually tells you,
in the meeting we’re arguing about this. I think it should be here, and
I think it should be here. And then you say well but if you do it this way,
that means that it’s going to change. Then the UI person is like no, no, no, we can’t have it connect to the album
because then our users will be so mad at us and blah blah blah blah blah,
and we wouldn’t make a lot of money. So when it’s all said and
done, you’ve argued through. It took you a lot longer to argue this
last bit, this was the hard part, to say okay, genre belongs to track,
it just does. And this is a simple, trivial, example of how the data model creates
the features of the application. So that when we started
we saw this big picture, that what’s in the data model determines
what the application is capable of doing. And we have to get the data model right,
so that it can all be very efficient. So now that we’ve drawn this picture
on the wall in our conference room, at some point we have to map this into
a database, into a database structure. Okay? We’re going to figure out
how to actually map that logical picture that we just
drew into a physical picture exactly how we’re going to
represent this in a database.

Video: 15.5 – Representing a Data Model in Tables

Key Concepts

  • Logical Model: A conceptual representation of data and their relationships (like the whiteboard drawing).
  • Physical Model: The actual implementation in a specific database system, with details like column names and data types.
  • Primary Key: A unique identifier for each row in a table (e.g., ‘id’).
  • Foreign Key: A column in a table that references the primary key of another table, creating a relationship (e.g., ‘album_id’).
  • Logical Key: A column that might be used to search or sort data (e.g., ‘title’).

The Process of Transformation

  1. Mapping: Take the logical model and add the following to each table:
    • Primary keys for unique row identification.
    • Foreign keys to represent relationships (start points of arrows).
    • Logical keys for potential search/sort columns.
  2. Naming Conventions: Establish clear naming conventions (e.g., table_id) for consistency and readability. Important: Adapt to existing conventions when working in a team.
  3. Creating the Tables:
    • With a Wizard: Use database tools with a graphical interface to create tables and specify their structure.
    • With SQL: Write SQL ‘CREATE TABLE’ statements for more control and speed.

Example: Creating the ‘Track’ Table

The video demonstrates creating the ‘Track’ table, highlighting:

  • Primary key (‘id’)
  • Logical key (‘title’)
  • Foreign keys (‘album_id’ and ‘genre_id’) to represent relationships

Takeaways

  • The translation from logical to physical is mostly mechanical once you understand the concepts and have a naming convention.
  • Consistency with naming conventions is crucial for readability and maintainability.

So welcome back to lunch, right? If you
remember the morning the last lecture we argued, we drank a bunch of coffee,
we scribbled on white boards, we erased white boards, and
we argued about where the genre belonged. But when we were all said and done we ended up with this picture
its still on the white board and we’re coming back from lunch now. This is now how databases represent data.
We have to be a little more explicit. There’s no like belongs to magic in databases but
when you’re building your data model you don’t want to be thinking explicitly too
much about the database. You turn this “logical” model into what we call a physical
model, by mapping it into the database. So, what we do is we just take each of
the tables, and the relationships between the tables, and we have to augment
them with little bits of information. And so, here’s the track table, and
so we’ll make a table called Track. And we take like title, rating,
length, and count, and they become just columns in the table.
That’s the data we have to represent. And then we tend to add things to it. And so we’re going to add what
we call a primary key. And a primary key is a way for us to refer to a particular row, and
so it’s a unique number like 1, 2, 3, 4. So each album is going to
end up with a number. We’ll put insert the title and
the title will get a number. Right? And then we use that number in a column of
a different table to sort of point to it. So these primary keys are the endpoints
that we’re pointing to. So that we can create an arrow
to a particular album, we’ve got to add a id to it. So we add a primary key. And then the other thing we do is we have
to add the columns that are the other end of the arrow, the starting point of the arrow. And so I’m by convention, in this particular
model, I’m naming primary keys all id. In track you’ve got to think all id and
then album_id is sort of the starting point. So album is the name of a table,
followed by underscore, followed by id. Now, there’s no rules about
how you name these things. These could be named X and
Y if you wanted to but, as we’ve learned in programming Python,
and other programming languages, making sensible names really reduces
the cognitive load of understanding what the heck you meant
when you were creating this. So we create a convention,
a naming convention. And, the, we have terminology for
each of these things. A primary key is the key for
which there’s one key for every row. It’s used as the way to point
to that row in other tables. The foreign key is one
of these columns that we add to be the starting point of the arrows. The primary key’s the ending
point of the arrow and the foreign key is the starting
point of the arrow. The only other thing
that we do is we may or may not designate one of the columns,
sometimes more than one, as what’s called the logical key. The logical key is that unique thing that we might use to look
up this row from the outside world. In this case, the album title
might be a way we’d look it up. Like, oh, where is Led Zeppelin IV. Type in IV and it’ll search it. So Led Zeppelin IV is going to be the title. Now all we are doing is saying, hey database, we’re not looking these
folks up by rating or length or count. But we might be looking
tracks up by title. So, somehow on your storage maybe you leave some shortcuts to get to
this more efficiently or more effectively. So the logical key is really our
way of saying we might use this in a WHERE clause. We might use it in a WHERE clause,
help me out out here. We might use it in a WHERE clause. And it also sometimes, it’s like we also
might use it in an ORDER BY clause or a WHERE clause. And so that’s the logical keys. So we have these three things, primary
keys, logical keys, and foreign keys. And so if we keep drawing these
pictures it starts looking complex but then it turns out really simple. If we take the whole data model and
sort of add these primary keys, logical keys, and foreign keys to
model the beginnings and ends, the beginning and end of arrow,
beginning and end of arrows, beginning and ending of arrows, we have to stick
primary keys in all the tables. We stick foreign keys in those
tables that have starting points. And then we indicate that certain
things are our logical keys. It looks kind of complex, but
once you’ve drawn the picture, and you have the naming convention for
what you’re going to name these things, it’s a surprisingly manual task. The mapping from
the logical to the physical is beautifully simple as long
as you have a naming convention. Now, if you walk into an organization and
you took this class and you say like Dr. Chuck’s naming convention is awesome. As a matter of fact, I stole this naming convention
from a thing called Ruby on Rails. The name of the id, the artist_id. It’s a good naming convention. And naming conventions
are kind of like artistic. Some you think are prettier or not
prettier, but any naming convention works. And you walk in and you go like oh,
I took this class and I learned this naming convention of
id and table_id and they’ll say well we don’t use that, we use this thing where
we start with a lowercase i and the we uppercase the iTrackID, that that
would be their primary key for a track, or maybe they call it
pTrack Id for primary key, right? So that would be the name of this thing and
you’re in this company and you’re like whoa. pTrackId, that looks like crap. it’s a
terrible, stupid idea. Well, don’t do that. I mean, this is not a superior
naming convention to this other one, however they do this other one, right? What’s important is that there is a naming
convention, and when you work for a company. just learn their
naming convention, and don’t complain about their naming
convention, unless you get to know them or maybe you’re sort of out for coffee or
something later and you’re like, well, how come you guys never did the
cool Ruby on Rails naming convention? I took a class and it was pretty cool as
long as you’re nice and you don’t threaten them and don’t tell them how dumb they are
by choosing the wrong naming convention. So now what we’re going to do is we’re
going to turn this into a new table. Okay, and so I’m going to use the wizard to
make new table in our little database. Let’s go over to,
we’re still in our database. And I am going to go
to Database Structure. Where’s the new table button? New table, new table. File. New table, create table. There we go. Edit, create table. Okay, so we are going to create a table,
and I need to kind of look at I’m looking for my cheat sheet.
We’ll start with the Artist table. The Artist table, and
we’re going to add a field to it and we are going to name the first field an id
field, and it’s going to be a number. We are going to say that it’s not null,
it’s a primary key. We’re telling it that we’re
going to use this a lot and if we’re linking into this table,
that means we’re using it a lot. Auto increment,
it means we as the programmer don’t actually have to specify this value,
it’s going to be automatically specified. Because pretty much 1, 2, 3, 4, 5 are
the values that we’re going to use. And then unsigned is whether
it’s positive or negative. And for this one I’m going to leave that off. The other thing that’s in this in
our little table, let’s go back and look at our little picture, is the name. Add another field and
we’ll name that name. Oops. Lowercase name. And we will make that a text field. And in this case we’re not
going to give it a length. And, there’s a way later we’ll
give other ones lengths. And then we are going
to hit the OK button. And when it’s all said and
done we have an Artist table, okay? This is our old Users table,
this is our Artist table. We could make a whole new
database if we wanted, but we’re not going to, we’re just
going to put all these things in. Actually, let’s just get rid of this table. Get rid of Users. Edit, Delete Table. Good bye User table. That was from the previous lecture. So now we have an Artist table. Okay? So that’s the wizard way
of creating a table. The next table we’re going to
create is the Genre table. Now the idea is we tend
to work from outward in. The leaves of this tree outward in. So we created the Artist table
then we’ll create the Genre table, then we’ll create the Album table. And then we’ll create the Track table
because you kind of want to create the ends of the arrows before you create
the beginnings of the arrows, okay? So we’ll create the Genre, and now I’m just going to use to SQL because
that’s a lot quicker for me to type. Here’s the CREATE TABLE, we’re going
to say that this guy is an INTEGER, it’s NOT NULL, it’s got a PRIMARY KEY
and it’s AUTOINCREMENT, and it’s UNIQUE. And that is our way of saying give each row a little number. So we go to Execute SQL and
I put this thing in here. And then I run it. It worked and if I take a look at
our Browse Data or our Database Structure, Genre is
in there and Genre has two columns, id and name. And
then we have the other tables. So now we’re going to
create the Album table. Now the thing that’s different
about the Album table is that the Album table is our
first example of a foreign key. So this is our foreign key. We’ve been putting primary
keys in every table. Primary key pretty much say this over and over and over again, this long INTEGER
NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE. Just say that over and
over and over again. Now, we’re going to have
a foreign key which is the start, the foreign key, primary key. Right? And so, we’re just going
to call that an integer. In other databases you can add some
stuff to it, talk about what to do with the parent thing, but
we’re not going to do that for now. We’re just going to say that’s an integer,
that’s a starting point of an arrow. And we’re naming it artist_id
just to jog our memory that that’s what’s going on there. Put SQL, paste it, run it, poof she works. Database Structure. Now we got Genre,
and let’s look at the last one. Okay? So the last one is the track. Okay? So CREATE TABLE,
the name of the table. We’ve got this sort of thing that by now
we’re cut and paste a bunch of times. We have a primary key,
we have a logical key. It just happens to be green, it’s TEXT. And then we have two foreign keys. Because if you remember,
there were two arrows coming out of Track. Both to Album and to Genre. So remember, oh, Album. album_id, genre_id. Naming convention is your friend. And then we just have some stuff
that’s integers like the length, the rating, and the count. Then we put those in. So now we have the columns and the rows
and all the foreign keys for Track. And then we will run that. And if we take a look at the data
we will see we got all of them done.
Album is now complex. And all these things. Wait a sec, did I not? What did I get wrong here? Oh, no, Track is the complex one. So Track has a bunch of stuff. So there we go. So if we take a look at,
we’ve got all of these things, and we have translated our logical
model into a physical model, and we translated the physical model
into a series of CREATE statements. And we’ve run those CREATE statements, and now we’ve represented the various kinds of
tables that we want to insert data into. So we’re going to stop here. But next we’re going to start
inserting data into these tables.

Video: 15.6 – Inserting Relational Data

Key Concepts

  • Logical to Physical Model: The previous exercise created the database structure (tables, foreign keys, etc.). This lesson is about populating it with data.
  • Outward-In Insertion: Always insert data into tables at the “start point” of relationships first (where foreign keys point OUTWARDS). This establishes the IDs needed for later steps.

Example: Inserting into the Music Database

  1. Artist and Genre: Start with the simplest tables, inserting ‘Led Zeppelin’, ‘AC/DC’, ‘Rock’, and ‘Metal’. The database auto-assigns IDs.
  2. Album: Insert album titles, but you MUST explicitly provide the relevant artist_id from the Artist table (e.g., 2 for AC/DC).
  3. Track: Most complex, as it has TWO foreign keys (album_id and genre_id). Again, provide the IDs, not the original text values.

Why Numbers Matter

  • Human vs. Computer: It’s hard for humans to remember which ID is which artist. Code can easily track that, ensuring consistency.
  • Efficiency: Storing ‘Rock’ a million times is wasteful. The ID ‘2’ is smaller, faster to search, and supports better indexing by the database.

So before lunch we developed
a logical model, after lunch we developed
a physical model. And sort of implemented it in the database
and then we went home for the night. And now it’s the next morning and it’s
time to insert some data into our tables. Now, you’ve got to remember all the keys
and fields and stuff like that. Just like we’re inserting data from
kind of like the outward simple things to the more central ones, we’re going to
do the same thing when we insert data. Okay? And so we’re using the INSERT INTO. And this is a good time to point out
that the keywords inside of SQL like insert into, the case doesn’t matter. The names of the tables, the names of
the columns, may or may not matter, and the actual data you put in that’s case
sensitive because that’s your data. So I’m going to type INSERT INTO Artist. And the key thing that is important here
is we’re not specifying the id field. So if you recall, in the Artist table
we have an id field and a name field. But if you look over here,
the id field is an auto-generated field. It’s not really showing up very well
because we don’t have a type. So the id field is auto-generated. So when I write this SQL I don’t need to specify the id because as
part of our contract we made a contract with a database that says,
you assign that. So I’m going to run this now, INSERT INTO
Artist with a name of Led Zeppelin. And if then we Browse Data and we go
look at the artist, it’s added the number 1. Kind of predictable. You can kind of
guess where the next one’s going to be, so the next one thing we’re going to do is
we’re going to insert another artist in, we’re going to insert AC/DC in. And, we inserted that. Now, if we go take a look at the
data again, we got 2. And if we did a new record here, it
would automatically provide this for us, right? I mean, it would let us change this, but
it would want to pick the next number 3. So what we’ve done here is
both inserted the record, but we’ve also established for
within this database a new number. 1 means Led Zepplin, and 2 means AC/DC.
Okay? So then, the next thing we’re
going to do is insert in the Genre, kind of working outward in,
and insert the value Rock and the value Metal, and now I’m
going to show you a little trick here. You can actually, in this user
interface of most of these things, not just this one, you can usually put
more than one command in as long as you put a semicolon at the end of it. So I’m going to, if I did that right,
semicolon at the end of it. So I’m going to do an INSERT INTO
Genre of Rock and an INSERT INTO Genre of Metal and
I’m going to run the SQL. Oh, I wonder if I did it twice. I did. So I’ll just delete this record. Good bye. Delete that record. Good bye. So I did two inserts and
I hit the botton twice, so that’s how I ended up
with two copies of it. So Rock for the rest of this lecture is
the value 1 and Metal is the value 2. And again now when we’re making the Track
table we don’t have to put Rock, Rock, Rock, Rock, Rock, we just put 1, 1, 1, 1, 1. Because 1 is not a string but
it’s a number, that turns out to be okay. So now we’re going to take
a look at the album and this is the first situation where we’ve
got to do a primary key, a foreign key, right? So we got an Album table. And the Album has a primary key which is
going to be automatically generated for us. And a foreign key called artist_id. Because the album points to the artist. And we only add data at the starting
point of these arrows and so that’s what we’re doing. So this one we have to put in title but we
don’t have to put in id so we don’t have to put that in but we do have to explicitly
put in artist_id. So when you insert into Album you’re going to have a title
and then we’re going to have an artist_id. This number we have to remember that
2 was AC/DC and 1 was Led Zeppelin. Now when we’re doing this by
hand it’s a little tricky and you find yourself flipping back and
forth to remember like, oh which one was Led Zeppelin,
which one was AC/DC? And it seems like in the short term it’d
be easier just to put the string in, but that’s the whole thing
we’re trying to avoid. Okay? So for foreign keys we have
to put the numbers in explicitly because we or the code we write
has got to remember them. So remembering what these numbers are,
the 2 and the 1, is a lot easier for a program to remember than it is for
you to remember doing this by hand. Now normally you wouldn’t
insert this stuff, you would have code that inserts it,
and it’s really easy for it to say, oh I just inserted Genre, and just
I inserted Rock, and I got back a 2. So actually you can get that, and so
you the human don’t have to remember this. So let’s run these. This is our first one that’s doing, and I’ll put a semicolon here so
we can do it all at once. And I’ll only push the button once. Okay, and if we do Browse Data and
we look at the Album, the id was auto-generated and
artist_id was something that we chose and title was something that we put in,
so now we’re in pretty good shape. Now the last thing we’re going to
do is probably the most complex. But it’s not. We have and id’s come in for
free, title is data, the length, rating, and count,
that’s just data. And then we have two arrows, right,
pointing to Album and Genre. Album and Genre, and so
we have two foreign keys. Now here is where we have replication now. Those two are going to the same album, and
these two are going to the same album. Same genre, same genre. It’s okay to have replication
as long as they’re numbers. That’s really the essence
of what we’ve done. We just like went in a big circle. Took replicated strings and
turned them into replicated numbers. So these are okay. And so then what you’re going to do, is you’re just going to take each
of these rows that you know. Well, we don’t need that. Clear. Take each of these rows, and
you know most of the data. We’re going to put in the title, we’re
going to put in the album_id, the length, the rating, the count. Put all those things in explicitly, this comes automatically,
and we’re going to do that. title, rating, length, count,
album_id, genre_id. And then we have the values, and then
these are the two foreign key values. And they point to those rows. And again this is the point in time where you
sort of have written down on a piece of paper what these little numbers were
if you were doing it by hand. It won’t be too bad. I luckily have them cut and pasted,
so they are all easy for me. And if you are following along,
then you just cut and paste these guys. And I’ll be super lazy, and
I’ll copy all four SQL statements, and run over here, paste them in, and
then I am going to put a semicolon, carefully, not on each line,
but at the end of the values thing, So I’ll put a semicolon
on each of these guys. Forgot to put a semicolon on the last one. And I’m going to hit the run,
and I’ll look at the data. And I’ll look at the Track data. And there we are. It’s all there. We can kind of move this over. And that’s the Track data and
again we specified all those things. So if you look at all these tables,
we’ve constructed them outwardly in. We started like at the Genre and
the Artist and worked our way in so that we could establish those numbers so
then we could link to those numbers as we put things in
the Album and then we established numbers. And so the foreign keys are the starting
points, the primary keys the ending points and we worked in. And so we have
replication of numbers in this column but we don’t have replication of strings. So the word Rock, if you
look at all this data. Okay that’s a bad example, forget that. The word Metal is better. That doesn’t count, that’s just a title,
that’s not really the word Rock. The word Metal as a genre
only appears once in the entire database. And the thing about it is like what if
there were a million of these, right? One million. You say okay, what’s the difference? The word metal verses whatever. Versus the number 2. If there’s millions of them it’s
not just that it takes up storage, it’s it has to do with how
much has to be scanned. So part of what you’re trying to do in
databases is reduce the amount of data that has to be scanned to get to
a particular piece of information. And so the difference between Metal, which
might actually be 128 characters, and a number, which is actually four characters, that
turns out to be different. And it’s not that we can’t start on disk drives. Disk drives
are big. It’s just that we can’t scan past it fast enough and we can’t built indexes
so we can jump past it even faster, okay? So that’s what we have
achieved in this section, is we’ve inserted all these data and
we’ve linked things together and we’ve modeled relationships and connection
points rather than replicating data.

Video: 15.7 – Reconstructing Data with JOIN

Why Relational Databases Matter

  • Scalability: Breaking data into tables with linked IDs (primary and foreign keys) allows for efficient handling of large datasets.
  • Less Redundancy: Storing data once and linking it means you don’t have to update the same information in multiple places if things change.
  • Flexibility: As your application grows, this structure lets you adapt without breaking everything you’ve already built.

The JOIN Operation

  • Purpose: Reconstructs data from multiple tables, bringing back the seamless view your users expect.
  • How it Works:
    • Combines rows from different tables based on matching values in the primary and foreign keys (the “arrows” in your data model).
    • The ON clause specifies how you want rows to be matched. Without it, you get ALL possible combinations (often unintended!).

Example Explained

To get a track title, artist name, album title, and genre name, a JOIN might look like this:

SQL

SELECT Track.title, Artist.name, Album.title, Genre.name  
FROM Track JOIN Genre JOIN Album JOIN Artist 
ON Track.genre_id = Genre.id 
AND Track.album_id = Album.id 
AND Album.artist_id = Artist.id
  • Key Points:
    • Naming conventions help you create JOIN clauses easily.
    • The picture of your data model is valuable for visualizing these connections.

The Full Circle

  1. User Interface Design: Start with what users need to see, even if it has repeated data.
  2. Data Modeling: Break down data to eliminate redundancy, connect tables with keys.
  3. Data Storage: Store everything efficiently.
  4. JOIN Operation: Reconstructs the user-friendly view on demand.

So, we’re about to come full circle. We just created these
many database tables, we’ve spread things out, we’ve linked them
all back together with these numbers. And it seems like we just
created a mess of everything. But now we’re about to bring it
all back together using the JOIN. We finally have touched
the relational power and it may seem painful, but this is
important, especially when data scales. And if what you’re doing matters,
then data’s going to scale. You can’t say, oh my data is so
small I don’t need to do this. Well, if you’re doing it, and it’s big, and
you’re really doing data analysis, and it’s like something
you’re doing professionally, you tend not to do small
things professionally. You tend to do large things. So this notion of primary keys,
and foreign keys, and the little numbers that point to stuff,
that’s what it’s all about. So we need to reconstruct, for our user interface, the data and
reconnect all these connections. And the SQL syntax that allows this
to happen is the JOIN operation. And the JOIN basically says, we are
selecting data from more than one table. And part of the JOIN operation is the ON
clause that says when we want to connect one row and one table with
a corresponding row in a different table. And so we have this example here of our data, where we just want to sort of pull the
artist title and the artist name out. Except that the artist title
is in the album field, and the artist’s name, because we did this
relationally, is in the artist field, so we don’t have it in one field so
we have to pull across two tables. So that’s where we say SELECT,
and then we want the things. Now this is a syntax here where
you have table name, field name, tablename.fieldname. The title column from the Album table,
the name column from the Artist table, FROM, which is the same
FROM we’ve done before. Here’s a table joined with another table. And what we’re doing is we’re saying, let’s make a super table that
really is the rows of this table, as well as the rows from that table,
and that’s what the JOIN does. It’s like, we’re not going to one table,
we’re going to two tables. Then, in addition to
the JOIN operation saying we’re going to do all these
tables, the ON clause decides when we connect a row in
this table to a row in this table. So row to row, row to row, and
the ON clause that we’re going to write is when the album’s artist_id
that is from the Albums table, the foreign key, the start of our
little arrow, is equal to the id field from Artist,
Artist.id, so it is the destination. So, this is the starting
point of the arrow and this is the ending point of the arrow. So, this is the start. This is the end. So, that’s matching, reconnecting all
the arrows for all the corresponding rows. Select these two fields, one field
from each table, and then we got to glom the two tables together, but
only connect the rows when this is true. For me, I would call this when. Some people prefer to write
this as a longer WHERE clause, some databases prefer that. I tend to like using the ON clause, so
I don’t kind of confuse my WHERE clauses. But if you learn the WHERE clause trick,
it’s the same as the ON clause. So let’s run this one. That’s already in there. Oops. So I’ll run it. So here I go. I’ve got the title. Oops.
I’ve got the title from one table and the name from another table. And we joined them all together. And we had an ON clause that tells
when they’re supposed to be connected. Okay? So we can look a little more deeply
into this relationship that we have, right? We can, you know, clear this. Look a little more deeply into
the relationship that we have here by actually looking at all the data
that’s actually participating in the connection between these two rows. So here we once again have our SELECT, and our FROM clause goes between
Album and Artist. So we’re kind of, this is our super row. From the Album table and
the Artist table. That’s Artist, that’s Album. And we have the foreign
key relationship still. And we’re going to add to this SELECT
these two fields. Everything is the same as before, except we’re going to show you how
this connection is made, okay? So we’re going to select the title,
the artist_id, the artist’s id, and name. So you can see how we’re seeing
the whole table at this table. These things have been joined together
like there’s glue in the middle between them. They’ve been glued together, and they’re glued together in the situation
where these two things match. So if you type that, that’s what you see. And so that’s a way these don’t
really need to be shown, but they are the source of the connection. Okay? So that’s the source of the connection,
and by just selecting them, you can see what it looks like. Okay, so let’s construct another one. Let’s say we want the Tracks title and
the Genre’s name. And if we take a look at the Track,
right, we’ve got genre_id. Now we have some replication here. And then we do need to do this lookup,
right? And we want to see all the tracks. Right?
All the tracks along with their genres. Right? Rock, Rock, Metal, Metal. So now we’ve reconstituted the duplicates.
Right? This is what the user wants to see. But this is what we’re
storing in our database. All right? So again, we say SELECT Track.title. That’s this thing. Genre.name. That’s that thing
from a different table. FROM Track JOIN TO Genre. So that makes, like, this big blob thing that’s a super
row, that smears across two tables. And then we have an ON clause. And you’d think these are hard to construct. But, oh this is the Track table. There is a foreign key name named
genre_id. Oh, and that’s going to be equal to
tablename.id. Right? I almost cut and paste these things so
fast as I write them. And so they get easier and easier and easier,
when you’re going from the Track table to the Genre table, and the syntax that I
use for the connection is very canonical. I do it over and over and over again. Okay? So let me run this guy and there we go. We’ve reconstructed the replication. But this also a good time to
show you a little bit more about what’s going on
underneath this join, right? And now it all seems pretty and
simple and this ON clause. But what happens if we just
take out the ON clause? Select these two things from
these guys glued together. Somehow we’ve glued these
two tables together. So if you don’t have an ON clause
what happens is it basically says all combinations. All combinations. So there’s four rows here and
two rows here. So all combination of each row here. So this turns into 2, this turns into 2,
this turns into 2, so that turns into 8. Now if this had 100, and
this had 100, that’d be 10,000. And so if you don’t have an ON clause
it ends up with all combinations. And we can show this. So this is exactly
the same SELECT statement. Track.title, genre_id, Genre.id, we’re adding the middle joiner guys
that would have been an ON clause. And we’re saying FROM Track,
JOIN with Genre, and no ON clause. Okay, that’s the thing that’s changed. We added these two columns and
we took away the ON clause. So let’s take a look at what
happens when we run this one. And lines don’t matter, so the fact that I put this on multiple
lines you will see it doesn’t care. And so you can add spacing and
lines to make your SQL look prettier. I do that all the time, I indent them. Indenting has no significance,
but it is still very pretty. So now we see that we’re
getting all the combinations. Right? 1, 1; 1, 2; 1, 1; so
Black Dog has got both genres. So it got genre 1 and genre 2. Because we don’t have an ON clause,
so it’s not interested in matching. And so you can think of a JOIN, you can think of a JOIN as building across both tables all possible
combinations between the tables. The ON clause is throwing away
the ones that don’t match. Or a better way to say it is it’s
picking the ones that do match, right? We like this combination. We like this combination. We like this combination. And we like this combination. That’s what the ON clause is doing, is it’s picking the ones
where there’s a match. If you don’t have the ON clause,
then boom. All combinations, 1 to 2. This is one row and
it’s being combined with that. This the the next row and
we want both combinations. This is the third row, all combinations;
fourth row, all combinations, and then the ON clause wipes
out the non-matching ones. Whoops, I did it wrong again. I keep doing it wrong,
but you get the idea. So that’s the ON clause. That’s really how it works, and frankly
this is one of the things when I first looked at, it must be doing that but
it doesn’t do that. When it sees the ON clause it’s
really clever about pulling things in. Super clever. Again, that’s not our problem. So now it can get complex. Because now what we want
is the track title, the artist name, the album title,
and the genre name. And this looks like a lot of stuff,
but if you follow it, it’s a pattern. So we have a SELECT,
these are the things we want. That is the list of the output we want. We’re not interested in all the id’s. And
we have to have this long FROM clause, that says, Track join with Genre
join with Album join with Artist. That’s all the rows,
all the tables concatenated together, and then the ON clause, which seems
kind of tricky, but now we start. We start from Track and work out. We have the foreign key in Track of
genre_id is equal to the genre’s primary key. And the Track’s album_id is equal
to the album’s primary key, and there’s one more foreign key and that is
the Album’s artist_id equals the Artist.id. So go back to that little
picture that we had this that pointed to this, that
pointed to that, that pointed to that. We just look at this, the picture,
and this is why the pictures are so important, and we reconstruct all this. And so this for me,
it may look a little complex. But after you do it a few times, you realize that the naming convention
really, really saves you, okay? Let me show you that,
select this whole monstrous mess. Type her in, and then run it. And we’ve reconstructed it, right? Now, we’ve got replication here,
but that’s the output. That’s the beauty of databases,
you reconstruct any replication, but you don’t actually store the replication. And that’s why you have to
learn to write JOIN clauses. So if we think about this,
we’ve gone full circle. it’s taken us, just like any start-up, it just took us like three
days to build our product. We started with a user interface that
we designed that had replication. Then we came up with a data, a logical
data model, then a physical data model, then we inserted all the data, we connected things with
numbers instead of strings. And now we use a JOIN to reconstruct it. And so, we start here. We go through a step, a step, a step,
a step, and then we come back to here, and we’re able to reconstruct, in effect, the output that our end users
are going to want to see. Except now we can handle millions or
billions of rows, because we have carefully constructed
a data model that makes sense.

Video: Worked Example: Tracks.py (Chapter 15)

Database Design

  • Normalization: The database uses separate tables for artists, albums, and tracks to reduce redundancy and enforce data integrity.
  • Primary Keys: Each table has a unique ID column (e.g., artist ID) to identify individual records.
  • Foreign Keys: Columns like album ID in the tracks table link the data together, forming the relationships.

CSV Import Process

  1. Reading and Parsing: The Python script reads the CSV file, splitting each line into track title, artist, album, etc.
  2. Artist & Album Insertion:
    • INSERT OR IGNORE ensures unique entries in the artist and album tables, preventing duplicates.
    • The script retrieves the primary keys (IDs) of inserted artists and albums.
  3. Track Insertion: The script inserts track data into the track table, using the retrieved foreign keys to link artists and albums.
  4. Committing Changes: The commit statement saves the changes to the database.

Database Interaction

  • SQLite3: The walkthrough uses SQLite as the database and shows how to connect and query from Python and the SQLite browser.

Key Points

  • The database is designed for efficiency and proper relationships between entities.
  • The script demonstrates a step-by-step approach to importing data while maintaining database integrity.

Hello everybody, and welcome to yet
another walkthrough of Python for Everybody. This is a walkthrough of
the tracks CSV application, and you can go to www.py4e.com/code3 and you will find a file named tracks.zip. Tracks.zip, download this and
extract it and you will have all the code
that I am working with. So, here I am in that file, let’s take
a look at the files that are in here. So, there’s an old folder for the old XML
version of this application if you want to try that one, it’s still sitting there. The auto grader doesn’t actually care
whether you use the CSV1 or the XML1. The XML1 is interesting if you
have to learn XML, but basically, what we’re going to do is we’re
going to take a CSV file, tracks CSV compass separated values, and this is just a whole bunch of data
from my [LAUGH] my own export of my iTunes data,
then converted to CSV from XML. So, it’s got the title of it,
the artist, etc. And so, we’re going to read
this one line at a time and then insert it in a database. And the key thing to this database is
we’re going to make it in normal form, meaning that we’re only going to
put the artist in once, we’re going to have a primary key. They’re going to have a unique name, so the word queen is going to only
appear once in that table. And so, we’re going to link everything
together through foreign keys like artist ID and album ID. And if we get this thing started,
we start her up, we go to trackdb.sqLite. And I want to be able to run this over and
over again, so the first thing I do is drop all three
tables and then I recreate them, both in case I want to change my schema or
if I just make a mistake and I want to start over because if you write
this or you’re modifying this code, you’ll make little mistakes and
you just want to start over. Every time you run the code
you want to start a fresh, so let’s take a look at the code
that’s going to run. We’re going to open that tracks.CSV file,
this one right here, and then it’s not a super formatted CSV. This is a really simple CSV that’s just
comm separated versus a true CSV is actually a format and there’s code inside
Python that parses the rich CSV format. But I’m just going to read through and
split it by commas. And in case there’s some weird piece of
code in there and there’s not what I want, if there’s less than six pieces,
I’m just going to skip it. So, I put a continue in here and
then I parse out the pieces, the name of the track, the artist, the album,
the count, the rating and the length. And so,
we’re most interested in an artist and album doing a money to one
relationship into that. So, I just immediately print those things
out because when I build something like this, I would build these lines first. And yes, it looks like my
parsing is working correctly. Then we’re going to do the many to one,
and so we work from the leaves
of the tree inward. So, we’re going to do an insert or
ignore into the artist by name. And so, whatever that is,
queen or whatever. Now that’s the first time it’s
going to actually insert it and assign a primary key. [COUGH] The second time this ignore
is going to work because the name on the artist table is defined as unique and
so you can’t insert that more than once. So, this will work the first time
we see a particular artist, but it will ignore at the rest of that point. But then we got to figure out what
the generated ID is, the primary key. So, we’re going to select that, so
you notice there’s an ID in there, that means auto generate this
if I don’t give it to you. And so, we’re going to grab the ID
from artist where name equals and then look that same artist up by the name
of them and then grab the artist ID. This is going to be fetch one, get one
record, and then the zero with item, which is there’s only one column and
there’s going to only be one row and artist ID is what we get. Same thing is true, except now we’re going to use artist
ID as a foreign key into the album. And so, we’re going to insert
an album by title, and artist ID is a foreign key in there. And then we’re going to do the same trick
where we’re going to look up the ID, the primary key of that particular album. We’re recording the artist ID,
if we’re inserting it, we want to record it, but now we’re
going to get the primary key of the album. We’ll have that and
then let’s get all that figured out, we’re working from outward in. So, we’re doing artist then we’re
doing album and then we’re do track, that’s outward in and then we’re going to
insert the foreign key, album ID, title length,
rating account into the table. And then we commit it so
that we write it all to the database. So, let’s go ahead and
run this code, python tracks.py, it opens the file,
loops through everything and away you go. So, if you look at the last file and
its microchip, Jason Farnham, and then that is
the last record that came out. So, now at this point,
let’s take a look at the database and trackdb.sqLite, that’s the file. And I can say I’ve got the sqLite
browser installed in the command line, you may have it in your, oops, [LAUGH] I can’t even type it,
and then trackdb. Okay, so I can say what tables do I have
in here, and I see they got the album, artist and track. And I can say select star
from artist limit ten. So, you see that Queens are the first one,
Led Zeppelin, these are the primary keys. There is only one record named Queen,
so I could even say this, select our star from
artist name equals Queen. So, you’ll see that there
is only one record, and then if we say select star from
track limit 10, we see they’re all connected together, and
we got the primary key for the track. And then the album ID is
this third parameter here, you see that’s album seven and
there’s a lot of reuse there in the data. So, let’s get out of this, so
I’m going to open the SQLite browser and we’re going to open the trackdb.sqLite. So, we see that those
are the tables that got created, we can take a look at the data. We can go look at, for example,
the artist data, we see again, one same kind of data there,
see all of them now. And then we can work our way into album, the album has a foreign key of artist ID,
and each album has its own ID. And then if we go to track, we see them
all together where we have album ID and this other data which is not
really foreign keys, but we see the album ID foreign key. So, that gives you hopefully a quick
run through of the CSV version of the tracks and
the one-to-many mapping into a database. [MUSIC]

Assessments


Quiz: Multi-Table Relational SQL

What is the primary added value of relational databases over flat files?

What is the purpose of a primary key?

Which of the following is NOT a good rule to follow when developing a database model?

If our user interface (i.e., like iTunes) has repeated strings on one column of the user interface, how should we model this properly in a database?

Which of the following is the label we give a column that the “outside world” uses to look up a particular row?

What is the label we give to a column that is an integer and used to point to a row in a different table?

What SQLite keyword is added to primary keys in a CREATE TABLE statement to indicate that the database is to provide a value for the column when records are inserted?

What is the SQL keyword that reconnects rows that have foreign keys with the corresponding data in the table that the foreign key points to?

What happens when you JOIN two tables together without an ON clause?

When you are doing a SELECT with a JOIN across multiple tables with identical column names, how do you distinguish the column names?

Multi-Table Database – Tracks

Code

Bonus


Video: Bonus: Office Hours Perth, Australia

  • Setting: Chuck Severance is hosting a “Face to Face Office Hours” session in Perth, Australia.
  • Purpose: Connecting Coursera students in person and sharing their experiences.
  • Student Introductions:
    • Ronald: A veteran Coursera user (7 courses) with a background in computer science, taking courses for fun.
    • Diana: From New Zealand, taking her first Coursera course (unfinished doctorate at Oxford mentioned).
    • Eloise: Using Coursera to supplement her education and enjoys the experience.
    • Terry: Enjoying a course led by Chuck and finds Coursera a unique learning format.

Key Takeaway: This video showcases the diverse backgrounds and motivations of Coursera students worldwide, fostering a sense of community within online learning.

Hi, this is Chuck Severance,
and here we are at yet another Face to Face Office Hours. This time we are in Perth,
at the western end of Australia, the world’s most remote capital city. So I want to introduce you to some of
your fellow students here from Perth. We’ll hear your name,
and say hi if you like. >> Hi, my name is Ronald, and
this is my seventh Coursera course. >> Excellent. >> And, yes, I’m wearing this t-shirt
here, I just want to show my friend. >> Okay.

Because it’s got this website and you know he wanted me to
wear it while on Coursera. >> And so we’ve got your t-shirt. Congratulations on your
seven Coursera classes. You’re pretty much using this
like a master’s degree, right? You got a bachelor’s degree and you’re kind of making up
your own master’s degree. >> Yes,
I did computer science back in the 90s. And I’m Solaris and Linux system admin. And I’m just doing this just for fun. And nostalgic,
because I am in the computer business. >> Yeah. Okay.
Nice meeting you anyway. >> Nice meeting you. Nostalgic
Hello. >> Hi.
I’m Diana. I’m from New Zealand and
it’s my first Coursera course. >> You’re also formally from Oxford,
right? >> I didn’t finish my doctorate. >> Oh [LAUGH]. >> Hi, my name’s Eloise, and I’m using Coursera to sort of fill
up some gaps in my education. And I’ve had a lot of fun tonight. >> Okay. >> Hi, I’m Terry. >> I’m enjoying Chuck’s course. Yeah.
It’s good enough. Coursera courses are something
different though. >> Yeah. Okay. So there you have it. Another successful Face to
Face Office Hours from Perth, Australia.

Video: Bonus Interview: Niklaus Wirth

Key Points

  • Back to Basics: Wirth felt programming should be taught with clear, rigorous languages instead of the alternatives available in the 1960s.
  • Birth of Pascal: Despite not winning the IFIP standard, he implemented Pascal out of necessity for teaching, later finding popularity on microcomputers due to its affordability and structured design.
  • Modula-2: System Development Focus: Introduced the ‘module’ concept for safe linking. Emphasized separate interface and implementation for type-checking, a key feature missing in Fortran.
  • Oberon: Simplicity Meets OOP Created with a focus on simplicity and object-oriented programming through type extension and procedure variables. Oberon also inspired a custom operating system.
  • The Hardware Connection: Wirth, originally an electrical engineer, was inspired by Xerox Alto workstations. This led him to build his own computers, Lilith and Ceres, closely aligned with Modula and Oberon, respectively.
  • FPGA Experiments: Explored simplifying processor design on FPGAs, implementing his own RISC-like processor. This allows his book to be updated, demonstrating the enduring principles of his work.

Legacy:

Niklaus Wirth’s contributions significantly shaped programming language design and education, promoting structured programming, modularity, and simplicity as core principles.

[MUSIC] When I returned to Switzerland, I got
an invitation at the University of Zurich and at the ETH to introduce
computer science as a new subject. And, of course, being a self-trained man
essentially, I looked what tools were available and
that was rather a disappointment. Yes, yes, there was ALGOL available, which
convinced me through its rigor and its good structure, but it had a lousy
implementation. And I felt using that ALGOL could only reduce the chances of it ever being
accepted. And on the other side, there was only Fortran, which I found unsuitable
for teaching. Not to talk about Assemble language or
Assemble code. So I decided to continue my work from
Stanford and implement, no, not another ALGOL compiler, but what later became
known as Pascal. I’d had been a member of the IFIP working group and there were
finally two proposals. One by Arthur Langharton from Amsterdam,
and one from me. And I might say I lost out, and then I decided to implement it
just in spite of it all because I needed it, I needed it
for teaching. And that’s how what lead to one and a half
years later, to Pascal 1969. And in 1971, I used it for the first time
in an introductory programming course. >> It kind of took a life of its own, right? It took a life of its own in the, in
corporate, in the corporate world. I mean, >> Yes, yes, although I
must say, it took some time. We had implemented Pascal on
quite a lot of different computers. And we had helped other universities who wanted it, to transport it to their
computers over Pascal P, the portable system. But the real breakthrough came actually
with the advent of the microcomputer. Apple II, particularly, also Tandy and some others, and they brought out UCSD Pascal and Pascal implementation by Borland,
Turbo Pascal. And and they were selling not only
compilers, but an integrated system with
text editor and debugger for something like $50. And that really
made the difference, at a time when compilers would still cost
thousands of dollars for large machines. And now of course they spread into the areas where people did not come
loaded with bad preconceptions, you know. They started
learning programming from scratch. And that’s how computing was brought into
homes and schools. >> What other language would have been
suitable for the mircroprocessor revolution in the early
days of the microprocessor revolution? >> Yeah, well that was of course
the point, yes? There was only one competitor, that
was Basic. But from the pedagogical point of view, I
think Pascal was the right thing. >> But even if you think of it as a systems development language, on
early microprocessors. >> Mm-hm. >> The fact that you had real interfaces. >> Yeah, yeah. >> And type. >> Yeah. >> And the ability to try to kind of at
some level make contracts. >> Types and- >> Basic, and I mean so Basic really
wasn’t a competitor at all. I think, if you start thinking about
doing something… >> Well, for beginner’s courses
it was, of course. >> Yes, >> And in particular in the US people were not used to structured languages,
you know. Even Fortran was a fairly flat thing, and,
but Pascal caught on quite well also in primary schools. >> And, and, and Pascal was the,
sort of, the API definition for Apple all the way
through Macintosh, right? >> That’s right. >> Did you interact with UCSD, with
Apple, or by then was it simply kind of a public, a public good? >> It was public good. Yes. No, I had very little interaction,
really almost none. Yeah. The Atlantic Ocean is too wide, or
was too wide for close interaction, I think, and we had always
distributed our software for free, covering the cost of the tape. And so of course nobody had an obligation
to fall back on me. >> Certainly, in the beginning of
computing, open source wasn’t something we thought of as special or different, it was
just a way of behaving. >> Yes, yes. I think universities always
behaved like that, open source. I mean, universities have an interest in
spreading their ideas, and not in protecting them. >> Now kind of talk about the follow-on the languages that you carefully didn’t
name Pascal. >> Yes, I did not, and from the
commercial point of view it’s regrettable because if I had
called Modula Pascal 2 and Oberon Pascal 3, they
wold have had better success. Yes, Modula-2 came nine years after
Pascal. And it was a language designed for system
development influenced also by Mesa, developed at Xerox
Park where I spent a sabbatical, Mesa itself based on Pascal, and the
primary new feature of Modula was the module.
And we fixed the interface specification, and implementation, to
separate compatibility of modules. Separate compatibility with tight interface type checking, that’s of course
what was missing in Fortran. And so linking different modular modules
together is as safe as just programming in one module,
and that was absolutely crucial thing. >> So talk about the transition then
from Modula to Oberon. >> That took place after my second sabbatical at Xerox Park starting ’87, ’89. I and my colleague Jurg Gutknecht had
actually become convinced that the future lay, particularly for
teaching programming, in simple languages. And Modula, to our taste, had already been overloaded with
features and facilities. And so we wanted to, to, clean up, a modular tool and that
resulted in Oberon. We added essentially only one feature, an important one though, that was type
extension. And together with type extension, and
together with procedure variables, which were already
present in Modula, you could implement the full scheme of
object orientation. So, from the outset that was the innovation in Oberon, simplicity plus
object orientation. >> So Oberon also begat an operating
system, correct? >> Yes. Remember I had spent my sabbatical, the
first one, at Xerox in ’76, ’77. And I was given an Alto computer, for
myself alone, under my desk. And that was, of course, an absolute change
in, in the way computers were used. At home I still had a terminal, linked
with a thin wire to a big machine, which was shared, I
mean, with hundreds of others. And so having my own computer, with a
bitmapped screen, you know, able to do much more flexible text editing, and
forms, and graphics, and all that. That was really for me a revelation.
And I decided that I wouldn’t want to program with these
old dinosaurs any more and I had to have one of these things too.
But they were not on sale. They couldn’t be bought. And the only thing I could do was to decide to build one
myself. And that’s how I diversified into
hardware. Fortunately I had been trained as an
electrical engineer, and so it was a bit easier. But in the meantime, which was something
like 15 years, electronics had undergone a big change, you know, from,
I was still trained on vacuum tubes and now they were to, not only to
transistors but to integrated chips. But it was really fascinating. And with
very little money, I think I got about 50,000 francs as a starting capital, we built a little workshop and built
prototypes. And they were of course then tuned to Modula. The language Modula,
and the compilers, and the operating systems, were closely
connected to, to the Lilith computer, it was called, and the whole thing
was in a way repeated ten years later with a pair of Oberon language, Oberon operating system, and the Ceres
computer. [MUSIC]. >> I felt that we should apply the same
principles of simplicity and well structuredness that we used in
software also to hardware. And this is now possible because of these
FPGAs. And so I got a Xilinx development tool
with an FPGA in it. I implemented a processor, I called it RISC, but it is much simpler than the ARM
or the MIPS or the SPARC. Really again concentrating on what is
essential and presentable to students. The processor’s very low code takes
about three pages, And then, and I’m just about finishing
this, I wrote an Oberon compiler for that
RISC architecture including compiler and linker
and downloader. And so now we start, or I start,
revising the book, completely rewriting the chapters on the
compiler and the linker load. But the rest remains surprisingly as it
was with relatively few changes, which shows that the ideas there
were quite modern. >> Do you have any FPGAs that are
programmed? >> Yes, this one is it, unplug it. That’s just a Xilinx development board
which costs about $100. >> And this is your microprocessor? I mean-

This is the FPGA. >> Right. >> Into which I have downloaded the the RISC processor. >> And so when you’re finished downloading
that, it becomes the microprocessor? >> That’s right, exactly. >> And how is memory come from? Oh, it has memory. >> These two chips are one megabyte. So by nowadays standards, of course,
it’s a small computer, but we can easily fit our Oberon
operating system. >> We understand what a FPGA is and then you’re building up, effectively,
from gates. >> Right, right. That’s the idea for teaching. And in the new version of the book there
will also be a chapter on this. Has to be, to make it complete. [MUSIC]

Video: Bonus: Office Hours Barcelona

  • Setting: This is Dr. Chuck Severance’s second “Office Hours” session in Barcelona for students taking his online courses.
  • Student Introductions:
    • Esha: Planning to start the Internet History course.
    • Hetish: Teasingly prefers Python over Petri (likely a course-specific reference).
    • Isabel: Working on the Python Specialization Capstone.
    • Christina: Linguist encouraging others to learn Python.
    • Enrico: Java programmer learning Python.
    • Marca: Enjoying the Python Specialization.
    • Shane: Excited to meet Dr. Chuck in person.
    • Theresa: Found the Python courses valuable.
  • Positive Atmosphere: Students express enthusiasm about the courses and appreciation for their instructor.
  • Key Takeaway: The video highlights the sense of community and personal connection fostered by these in-person sessions in addition to online learning.

Okay, hello everybody. This is the second Office Hours in
Barcelona given that I lost the video for the first Office Hours but
thankfully I was here another day. So we had another Office Hour. A few people came to both of them and
then we have a bunch of new people. So I’d like you to meet the people,
your fellow students from the class, okay? And you can say hi even though
you’re not in the class yet. >> Hi, I’m Esha and
I’m planning on starting the Internet History class

Okay, I’ll give you an Internet history sticker,
you don’t even have to finish. >> Hi, I’m Hetish and
Python is a lot cooler than Petri. >> [LAUGH] No, no, no, no, no. >> Hi-
I’m going to cut that part out. >> Hi, my name is Isabel and
I’m taking the Python Specialization and I will finish the Capstone. >> Yes. >> Hi, my name is Christina,
I’m a linguist, and I encourage all linguists to learn Python. >> Great. >> Hi, my name is Enrico, I’m a Java
programmer, I’m learning Python. Thank you. >> Thank you. >> Hi, my name is Marca,
I’m taking the Python Specialization, it’s very interesting and
I highly recommend it to everybody. >> Okay. >> Hi, I’m Shane and
I’m privileged to meet Dear Mister Chalk. >> I’m privileged to meet you. >> [LAUGH] Okay, and I think all
of you folks are envious about us because we had a very interesting
companionship with our dear teacher. >> Well thank you. >> Hello I’m Theresa. I was enjoying the Python courses and
I think that you can learn a lot with it. >> Thank you, thank you. Well, so there we go. Another successful
Office Hours with Cavesa and wonderful students here in Barcelona. So cheers. See you next time. [MUSIC] It is like a pigeon. It’s exactly like a pigeon. Okay. >> [LAUGH]
So here we go. So I’ll put this close to your face but
it’s a GoPro so it’s got a real wide angle. So it’s not only taking
a picture of your nose but it gives really good audio that way. >> Okay. >> Okay? So I’ll start. >> That’s the pro’s mistake. >> [LAUGH]
It is. [MUSIC]