Skip to content
Home » University of Michigan » Python for Everybody Specialization » Using Databases with Python » Week 2: Basic Structured Query Language

Week 2: Basic Structured Query Language

We learn the four core CRUD operations (Create, Read, Update, and Delete) to manage data stored in a database.


Lectures


Video: 15.1 Relational Databases

Why Databases Matter

  • Efficiency: Databases allow blazing-fast retrieval of data from massive stores, unlike older tape-based systems. This is crucial for modern applications like online banking.
  • Relational Model: Databases organize data intelligently for fast access, avoiding the need to read sequentially like on a tape. This innovation (largely from the 60s and 70s) powers companies like Oracle!
  • Abstraction: Complexities of data storage and retrieval are hidden beneath database software layers. This makes development easier.

Database Concepts

  • Terminology: There are fancy terms (relation, tuple, attribute) and more common programmer terms (table, row, column). Don’t be intimidated by the jargon.
  • Schemas: Like labeling columns in a spreadsheet, databases have strict rules about data types and formats in each column. This enforces consistency.
  • The Problem with Spreadsheets: While they can work for simple data, spreadsheets become unwieldy for large or complex data, especially those with lots of repeated information.

Introduction to SQL

  • SQL (Structured Query Language): The standard language for communicating with databases, allowing portability across database systems (Oracle, MySQL, etc.)
  • Beauty of SQL: A powerful and elegant language, great for working with well-structured data.
  • CRUD: The core database functions SQL handles are Create, Read, Update, and Delete.

Why Python + SQL is Powerful

  • Python: Handles messy, unstructured data with ease. Cleans data up for the database.
  • SQL: Excels at storing, retrieving, and working with clean, organized data.

Next Up: The focus shifts to practical interaction with databases, primarily in the context of smaller projects.

So now we’re going to move
into talking about databases. And it may seem a little strange to be
talking about databases in a programming language, but we are using this
programming language for doing data analysis. And so it turns out that it’s quite often
necessary when you’re pulling data, especially data you’re pulling over a
network, where you might be rate limited, or you want to store the data in a database. You have a process that reads the data and
puts it in a database, and then you can analyse it
out of the database. And it makes some of your
processes go really fast. It allows you to change your analysis, and not lock your analysis into
the retrieval of the information. So it speeds up your overall workflow,
but you have to learn a bunch of stuff so that you can put the information
into your database. Now, the first task that you have is
to install your database browser. There’s a couple of different ways
to read and write these files. We’re going to use Python to read and
write the files, but we want a way to read and
write the files directly. And so this is code that you can download
for the Mac, for Windows, for Linux. There’s also a Chrome plugin
that you can get for SQLite, that you could do everything
that we’re doing in the class. And so I don’t really care how you achieve
the things you’re going to achieve in the class, because we’re going to grade you,
not so much about what, how you use things. But instead, we’re going to look at, you know,
the databases that you produce, okay? So stop now, download this, install it. It should be simple, should be easy. Check with the forums if
you have some problems. So relational databases are a whole
subfield of computer science. And it might be best to talk about what
life was like before relational databases, and you use relational databases thousands
of times a day without even knowing it. If you imagine something like
a learning management system, where there’s hundreds of thousands of
users, and terabytes or petabytes of data, and you log in, and within a half a second
it shows you what you’re supposed to see. You can’t read a petabyte of data in a
half a second, you just can’t. And in the early days,
we used to have data that was small and computers that didn’t have
a lot of storage in them. So we tended to use tapes and we would put the data like on a magnetic
tape, and we would sort the data. We would have old bank balances on one
tape, we would have the transactions, and then we would read one balance, check
to see if that transaction had changed, then we would store the balance. And you would have like last night’s bank
balances, and tonight’s bank balances. And then the next day,
you would put this over here, and read the old bank balances and
make changes. And this is how, day after day, in the 60s
and the 70s, we would update data. If you look at old computer videos,
you’ll see these little spinning things. Well those are tape drives, and that’s where most of the real
data was stored, and it was. But as computers got faster,
as they got more memory, and as we started to store data more
in the memory of computers and on the disk drives a whole
different way, so you didn’t have to read to get through the
account where it starts with the letter s. You didn’t have to read through all
the data from a through s just to get to my account that starts with s. And you’d be like oh, wait a second. We have a disk drive that we can skip all
the way and skip back and skip over here. And so the problem became how to make
sure of this random access medium in which we can store data in a way
that’s efficient and fast and clever. So just because you could randomly
access the data didn’t mean it was fast. You still had all the data to look at and
if you just looked at it sequentially, so relational databases were
this applying cleverness to how we would use random access data
storage, mostly disk drives that spun. And it really sort of
emerged in the 60s and the 70s, and whole companies were
formed like Oracle. Oracle exists because smart
people figured this problem out before other smart people did. And so they got to form a company and Oracle’s the leading database vendor
on the planet and much of its revenue comes from its database product and
things built on top of its database. But literally before 1960, the concept
of database really wasn’t an idea. It was something that happened as storage
and what we wanted to do with computers. So this is where, if you had to really
read a tape to log in to a computer it might take four hours to log in. Which would be unacceptable
in this modern day and age. As I mentioned, the database is sort
of this technology that emerged and it emerged from a lot of
theoretical analysis and the underlying foundations of database have to
do with some really powerful mathematics. And so there’s powerful mathematics and
it still is present in some of the terminology that some
people use to describe databases. And so there’s really kind of two parallel
terminologies that you will encounter. You’ll pick a book up and you’ll kind
of be able to read it and say oh, this is using the highfalutin hoity-toity
language, which is the more math-oriented. So in the more math-oriented, we use
the words relation, tuple, and attribute. That’s kind of the fancy
way of speaking about it. But sort of we programmers
who just do our thing, we would call it a table,
a row, and a column. Now, table and row and column is kind
of the wrong way to think about it if you’re trying to understand the true
underlying amazing mathematics. The underlying mathematics, don’t worry
about the underlying mathematics. Just be aware as you’re reading, don’t be surprised when people lapse
into this more fancy nomenclature. So the idea is that you model data at
a connection point rather than like, here’s data and we’re starting here and
we’re reading through it. The idea is if you model everything
as a connection, like who a person is. A person is a connection between
this, that, and the other thing. And so this notion of modeling
stuff at a connection is the underlying math that
makes databases fast, but when we programmers think about it we kind
of think about it as rows and columns. And so here’s just a screenshot
of a spreadsheet that I made. Along the bottom you see the names
of the various subsheets, tracks, albums, artists,
genres, and titles. And those are like the database tables. And then each table, if you select it,
has a row and a column. So it’s got these columns. This has three columns in it and
it has a bunch of rows. The other thing,
when you’re doing stuff in a spreadsheet. You just kind of come up with a
data model, a schema, a strategy so that you know that the first column is
always the title, the second column is the rating, because if you didn’t
label it, it would make no sense. So we often do something where
the first row of a spreadsheet is kind of metadata about the columns,
right? Now, the spreadsheet doesn’t
exactly know that you’re doing it. Even though some of the things you do,
like sometimes it sorts and says, oh, check this tickbox to say the first
row is title, so don’t sort it. So when you sort stuff you only sort
like the bottom part of it, right? And so it sort of knows it’s the title. But in a database this becomes
what we call the schema. All this, all these titles and we have rules like these have
to be integer numbers, this has to be an integer number, this has to
be string no more than 128 characters. So we make strong contracts about the
content that sort of at some level feels very similar to what we´re doing
when we´re writing these titles. But in reality, as you’ll see, it is far
more complex and far more intricate. Now, if you’ve ever tried to do something
like we’re going to do with this database, and that is categorize your music
collection, you find that a spreadsheet is a really hard way to categorize
your music collection because of so much repeated data. And we’ll figure that out in databases. So the idea that they came
up with in the 60s and the 70s is there was a way
to represent data on a disk. And it could be randomly accessed and there was pointers that pointed
to something else, and whatever. And in the early days they figured out
how to technically solve fast access to lots of data by hopping in various
ways, indexing, doing clever things. And the early ways that we programmed these
databases is we sort of revealed to us application programmers, the low-level
capabilities of read this thing, then jump to this other thing,
then read this other thing, then jump to the other thing,
read this thing, jump to this thing. Which made our lives really difficult. And so we had to write really
sophisticated programs, but when we did they could
function very rapidly. So what happened over time is instead of
our code talking directly to the files or the databases, instead, a layer started to build up that we call the database application,
database. I’m not doing a very good job drawing,
right? So here’s our database, and
it has lots of complexity in it. And originally, we sort of just talked
straight to that complexity, but after a while we would have
a database application. And then our application, our code, would
talk to the database application, and the database would know
all the magic stuff. And it was what we call in
computer science an abstraction, which means that our job, this is us, and our job was easier because
the complexity could all be hidden here. And so the question becomes, then how do we talk to this really powerful
piece of software that we call a database? What is living here? What is the way we communicate? Sometimes we would call this an API,
application program interface. So this is our application. This is actually another application,
the database itself is an application. And so we have to have an interface so that our application can talk
to the other application. And just like in the previous
lecture this is like a service. It’s taking care of it. And ultimately,
let me change the color here. It’s getting kind of messy, we can sort
of think of this thing right here, both the data and the software that
understands the shape of the data, as just a service in
a service oriented architecture. So once again, what we have to
do is define, to use the service oriented architecture, the cut point. What is the cut point between our
application and this magic stuff? And this is Oracle. Right? So it’s lots and lots of money. And it’s amazing, it’s just really
good software. Right? Oracle is really good. Microsoft is really good. So what happened was, we decided to
create a standard at this point. The industry did. The folks got together with the National
Institute of Standards and Technology, NIST, and they agreed on a language that was the API between
an application and a database system. And the name of that thing
they came up with was SQL, or the Structured Query Language, okay? And so that means that, wow,
I’ve made such a mess of this, I have to start over again and
draw another picture. Right? So that means that over
here you could have Oracle. Over here you could have Microsoft. And here you could have APP1. And here you could have APP2. And APP1 could talk to Oracle,
or it could talk to MySQL. Or APP2 could talk to Oracle,
or APP2 could talk to MySQL. And so because they use the same
communication between Oracle and MySQL, that means that you can
write an app that’s portable between different databases. And, it turns out that this
SQL is a beautiful language, it’s just a gorgeous language. The reason I don’t teach SQL as the very
first programming language is you would be ruined. If you learned SQL you would never
want to program in any other language. I think it’s the most beautiful
language that I’ve ever encountered. It’s beautiful because of its
simplicity and its expressiveness, and its power, but then it’s a very
beautiful and elegant language. The problem with SQL is,
it depends on the data being pretty. It’s a great way to write code
about really gorgeous data. So part of what we’re going to do is learn
how to make the data look really good, and then write really cool stuff. Python, on the other hand,
is a little rough around the edges. But it also has no problem dealing
with unstructured data, or data that is imperfect, or whatever. And you can just keep writing more Python. And eventually you’re like oh man I
thought it wasn’t as complex at this. I got to add a little bit
of if statement here and add another if statement there and
do another thing. So Python handles unstructured rough
data much better than databases and that’s why Python plus SQL
is such a powerful thing. Python cleans up the data. SQL is a great way to store and
retrieve data. So what SQL does is it
has four basic functions. We call it CRUD. Create, read, update, and delete. Update is missing here. CRUD. So that’s a database term. It’s CRUD.
Create, read, update, and delete. So up next, we’re going to talk about how you
as the application developer are going to interact, both in large projects and
in small projects. Small projects are going to be the more
typical thing that we use in this class. So that’s what we’re going
to talk about in a bit, and how you use software to
interact with databases.

Video: 15.2 – Using Databases

Roles in Database Systems

  • Large Projects:
    • Developer: Writes the application code, interacting with the database via SQL.
    • Database Administrator (DBA): Directly manages the database, ensuring performance and security.
  • Smaller Projects (like in this course): You’ll wear both hats – developer and DBA.

Database Software

  • DBA Tools: Software like SQLite Browser lets you interact directly with the database, creating tables, and entering data. This is what we’ll use initially.
  • Application Code: Later, you’ll write Python programs that:
    • Acquire and clean data
    • Store the cleaned data in a database
    • Write other Python programs to analyze data from the database

Database Types

  • Commercial:
    • Oracle: Expensive, powerful, industry leader.
    • Microsoft SQL Server: Popular in Microsoft-centric environments.
  • Open Source:
    • MySQL: Fast and simple, often used for websites.
    • MariaDB: Fork of MySQL, ensuring open-source status.
    • PostgreSQL: More feature-rich, competes directly with Oracle.

Why SQLite?

  • Embedded: SQLite is designed to be part of an application, not a separate server.
  • Built-in to Python: Makes it super easy to get started with databases, no extra installation.
  • Great for Learning: While it’s designed for smaller datasets, it teaches all the core SQL concepts that apply to those larger database systems.

So now we want to talk about
how we place the database and the database software in
relative to the other aspects of the programming environment
that we’re going to build. And we’re going to talk both about
large projects and small projects. And a large project would
use sort of a web site. Like, you know maybe you build a web site to
track comic books or something, and you store all the data in the database. And so, in a large project, and we’re
not going to do it in this, so much in this class, but I figured it’s time now that
we’re learning about databases to sort of give you a sense of how they’re
used in these online systems. And maybe in a future course
you’ll see how they’re used, and learn how to do
this in large projects. But there are generally sort of two roles. One is the person who
writes the application and produces the user interface for
the application, and for the web site. And then there’s the other that actually
sort of monitors and adjusts the database, and that person is
the database administrator. Often in a large project, you will get some advice as an application
developer from a database administrator. And the idea is is that
database tuning and monitoring is a somewhat specialized job. And so in a professional environment, like the open source learning management
system that I work on named Sakai, in production I don’t get as
a developer, and so that’s who I am. This is me, I’m a developer. I write cool code, I make really
neat widgets for the end user, and this is the end user in a web browser,
talking to my cool thing, and I add a little button and
stuff like that. And I write code that actually talks
to the database, the database server, which is both code plus
the data storage itself. And I write code that talks SQL to the
database server and gets data back, and then formats it for the end user,
and that’s what developers do. They’re the ones that create
application software. The database administrator, generally
we don’t let the developer talk to the production database directly, because
we’re afraid the developer will break it. And that’s probably because there’s
been a lot of experience where developers broke it. So, we have a special person
called the database administrator. And they’re the ones that
are allowed to talk to the database. Now, because the database is both storage and software, it can take SQL
queries from many different sources. Both the application, or database tools. And so the database administrator has like
a direct way to talk to the database and get data back. And so if something’s going wrong, like a,
it’s gotten really large, what’s going on? So the database administrator will start
running commands straight to the database. Whereas the developer generally
has no access to the real database, Eventually we can write a new version
of the application software. And so that’s kind of the split between
the roles between developer and database administrator in large projects. And the database administrator helps
with the design and implementation. And often a database
administrator is shared across a number of software projects. But that’s not exactly what
we’re going to do, right? So it’d be nice, and maybe you have a job. Maybe you do have a job that there’s
a database administrator, and it’ll kind of look like that. But right now, during this class,
we’ve met the database administrator and the application
developer and they’re both you. Okay?
They’re both you. Or me. Right? And so what’s going to happen now
is you get to do both roles. And later, if you have somebody
to help you with it, you might turn more into
the developer side, or more into the database administrator
side, that’s perfectly fine. So we still have a database system, right? And that’s got both the logic, the
software that makes the database smart, and the data itself, and we talk to it. And the program we’re going
to use initially right now, is we’re going to study
the DBA way of doing business. You’re going to write SQL, you’re going
to talk to this little piece of software that’s going to submit stuff directly. And so, you are going to act
as a database administrator. We’re going to create tables,
put data in them and we’re just going to type stuff into
this application called SQLite browser. There are others of these applications
that are like database administrator applications. Much later, much later we are going
to start writing Python programs. And in those programs we’re going to talk
to the same database files using SQL. And we’re not going to write
programs that have a web interface. That would be like Twitter or
Comicbookregistry.com or whatever. We’re going to have various
sources of inputs. Sometimes files, sometimes network, and we’re going to write programs
that read this stuff. And clean it up. Often there’s a cleaning step. And then we are going to write
the data into a database. And so we’re going to grow this data,
fill it up, and then we’ll sort of be done with this phase of acquiring data, and all
our data will be sitting in a database. Then we will write another Python
program to read this data and maybe make a pretty file. And even later in the class,
we will be visualizing that or sending it to various other places. You might send it to R to
do statistical analysis, or Excel to do some simple analysis or
visualization, or we’ll use JavaScript to make pretty
pictures and spinning things and whatever. So, the basic pattern
here is it’s the same. We’re going to have sort of apps and
then we’re going to have DBA. But they’re both you. So the thing we’re going to
focus on first is right here. The database we’re going to use is a thing
called SQLite, and the software we’re going to use is what I told you to install
right at the beginning, a while ago. Hopefully you’ve done that, SQLite browser,
because otherwise you can’t talk to the database, unless you
have something equivalent to this. And they do exist. But the basic pattern is,
read data, clean it up, stick it in the database in a sensible and
organized way. And then write other applications to
make sense of it and analyze the data. And that’s what we’re
going to be doing coming up. So, when we build this data,
I mentioned the word schema, right? That we come up with a contract for exactly the way the data
is supposed to look. And with this contract, the database software can make
the best use of itself, right? It can make good use of the disk drive. It can make good use of
all these other things. And so, we call this the database model. And it’s best thought of as a contract. Okay, we’re going to talk into this file. There’s going to be magic things
that are stored in here. We use this database software. And the schema is our way to inform the database how we think this,
how we’re going to use the data and let the database figure out the way
to play it, where to put it. I’ll put this thing over here,
I’ll put another thing over here, and I’ll put a little shortcut over here. We don’t care about that. We say this is the data model we’re going
to work with, and you the database code are supposed to be really smart and figure
out how to make the most sense of that. So the data model is not just the SQL
to query it, but the data model is our contract about the shape of what we’re
going to store and retrieve in our database. So there’s a number of common database
systems, I’ve mentioned several of them. Oracle is awesome. It’s a company that was effectively
created just to make a database system. And they were founded at the birth
of the relational movement, and won because they were very good at it. The others who didn’t win were
like IBM, and others, although they all had relational
databases as well. A lot of those other vendors who let, well I don’t think
letting Oracle win is the right word. Those vendors that Oracle out-competed were the ones that
didn’t embrace this relational stuff. Because I was in college when they
saw what was coming down and some of us were like, ah,
relational, too much math. Give me the quick, dirty stuff, right? And it took a while for
the relational model really to win, but when it won, it won big time. And Oracle was a company that
made a bet on it winning. And so by the time we all realized,
whoa, it’s a good idea, because we all thought it was a bad idea,
or at least I did, it was too late. Oracle had the best implementation
of the good idea that we all finally realized was a good idea. And so, Oracle kind of wins. So then what happens is Oracle’s kind
of the old dog in the marketplace, the very expensive, commercial, a lot
of people think it’s simply the best. But there’s alternatives
that are quite nice. There’s MySql, which is a open source. MySql has been purchased by Oracle,
but continues to be open source. If you really want to be open source,
there’s a MySql MariaDB. MariaDB, it’s called. It starts with an M on purpose so that it starts with
the same letter as that. So MariaDB is a copy of the open
source version of MySql which was made at the moment
that MySql was purchased by But MySql remains open source so
a lot of people who use MySql are switching to Maria, but
a lot of people just keep using MySql. And if Oracle does something bad with
MySql, we’ll all switch to Maria and that’ll be the end of that. So at that point, Oracle hasn’t done
anything really bad with MySql, and so it’s okay. So we tend to use MySql for online
web sites, a lot of web site stuff. Oracle tends to be used for
enterprise kinds of things. And then another one that’s very popular
is SqlServer, it’s the Microsoft one. You may have even used
Microsoft Access and you might have even done like a screen,
where you’re dropping things down that turn out to be SQL under the
covers, don’t know. SqlServer for
Microsoft is a very good product. Some enterprise organizations are all
Microsoft and so they use SqlServer. Don’t feel bad, pretty much
everything I’m going to talk about applies equally to SQL because
SQL’s a standard and that’s nice. There’s another one called
Postgres that’s also open source that’s very nice and it’s kind of
more of an imitation of Oracle than MySql. MySql is designed to be simpler and
faster and solve less problems than Oracle,
but do them super fast. And that’s why it’s often
very popular in web sites. Postgres is sort of
more of a rich database that’s very much a direct
competitor of Oracle. MySql is its kind of own world. And we’re not going to use any of these. If you go take sort of a web class,
you will probably use MySql and everything I’m going to teach you
in this class is useful there. Okay? All the knowledge, all the SQL. But we’re going to use a thing
called SQLite. Lite meaning little, tiny, not very heavy. And so, SQLite is what’s
called an embedded database. And MySql or Oracle are whole
pieces of separate software that we sort of talk across
a network connection to. But SQLite is actually part of software,
and so it’s built in. Your car, if it’s a reasonably modern car, probably has five copies of SQLite
in it, right? So your car has SQLite. If you have a fancy music player in
your car, it probably is using it. If you have an iPhone, it probably
has 30 copies of SQLite in it, because SQLite is designed to be really small,
so small that you actually don’t have to separate it, you just build
it into the application, and the thing that matters most to us
is that it’s built into Python. It’s already there in Python. So that’s super awesome, right? And it’s fast and
aimed at smaller amounts of data, but because in Python we’re doing data
analysis, we will make good use of SQLite. And the fact that it’s built in
just makes our life really simple. We don’t have to install
a bunch of software. Python has it with nothing
more than an import statement. You know how this story goes. If you’re in Python and
you’ve got to get something done, the first thing you do is figure
out what import statement you need. Okay? So that gives us a sense the of the SQL
applications that we’ll be using. Up next we’re really going to
start writing some SQL. Make a database and put some tables in it,
and put some data in those tables.

Video: 15.3 – Single Table CRUD

Creating a Database

  • SQLite Browser: This is the software used to interact with the database file. Other tools could also be used.
  • Importance of a Schema: Before creating a table, you define the columns and their data types (like name with VARCHAR). This creates a contract for fast, efficient data operations.

Core SQL Commands (CRUD)

  • CREATE TABLE: Defines the table structure (example: CREATE TABLE Users (name VARCHAR(128), email VARCHAR(128)))
  • INSERT INTO: Adds a row of data (example: INSERT INTO Users (name, email) VALUES ('Sally', 'sally@umich.edu'))
  • DELETE FROM: Removes rows matching a condition (example: DELETE FROM Users WHERE email='ted@umich.edu')
  • UPDATE … SET: Changes data in specific cells (example: UPDATE Users SET name='Charles' WHERE email='csev@umich.edu')
  • SELECT … FROM: Retrieves data, potentially using WHERE for filtering and ORDER BY for sorting. (example: SELECT * FROM Users ORDER BY email)

Why SQL?

  • Intuitive: The language is designed to resemble conversational English.
  • Powerful: While seemingly simple, SQL enables complex data manipulation and leverages the database’s underlying optimizations for speed.

What’s Next

The real power of databases comes from modeling relationships between tables. That’s the focus of upcoming lessons!

So now we’re going to get
to the heart of the matter, we’re going to actually write some SQL
and we’re going to make a database. Okay, you hopefully by now, I’ve told you a couple of times, installed
the SQLite browser. If you haven’t, do it. You could also write a Firefox plug-in that can be used to manipulate
SQLite databases. I used this before SQLite browser
came out with their newest version. So, everything I’m going to do
could be used by either one. And there might even be
a way to do it with an SQLite admin tool if you downloaded
like a MAMP or an XAMPP. So it doesn’t matter. You just need some piece of software
that gives you a place to type SQL, You hit a go button, and it sends it
to the database and comes back. That’s all you need. Okay. This is what it looks like,
SQLite, SQLite browser. So let me go ahead and start
this thing up. SQLite browser, and there we are. Database Browser for SQLite. And what it basically does is this
is going to read and write a file. And that file that it’s going to read and
write is in a very specialized format that’s highly optimized for
rapid retrieval and rapid storage of data. And so we’ll start by saying New Database and I’m going to pick a place on the disk. I will put it in the desktop and
python for everybody, databases and I’ll call it sql1. That’ll just be the name of the database. Now, it’s writing to a file. This file is not something you should read
using anything other than a SQLite library. It’s a funky looking file. So
we’re going to save it, boop. And it’s asking us if we
want to create a table. I’ll just hide that thing. So we have no tables, no indexes,
no views, no triggers, no nothing. And we have a place that
we can execute SQL. So when I said you’re
a database administrator, we got a file back there called sql1 that
everything we do is actually reading and writing to that file, but we don’t
write code to touch that file directly. We always are using the SQLite library,
okay? So we’re up, let’s create a table. Now this is our first SQL. And so CREATE TABLE and
then the name of the table, Users. I’m just going to use upper case. Upper lower case generally matters. There are a few weird situations
where it doesn’t matter, so just pretend that case
matters in this situation. And it teaches you something
right away about SQL. SQL is designed to be
somewhat conversational. And so the keyword, CREATE TABLE, is kind of like one keyword,
we’re creating a table. Then the name of the thing and then
a parenthesized list of the fields and then the types of the field,
that’s what we’re dealing with. So this is CREATE TABLE named Users, and
we’re going to put two columns into it, and one is of variable length
character, up to 128 characters, an email that’s also a variable
length character up to 128 characters. So, this is sort of our schema,
this is our contract. We’re saying these are the only two
things we’re putting in this table. We can’t put a thing
called age in the table, we can’t put a thing called
favorite beverage in the table and we can’t put 129 characters
in the name column. We can put 0 or 1 or 2 or 127 or 128, that’s just fine, but we can’t put 129. This is a contract. And the database, if we want to put longer
characters in, we just say no, no, no, no. We want to put 1024 in and then the contract
will be we could put up to 1024 and if we put 1025, the database would
be mad. And like why do you do this? Why is it, why is the database so
obsessed about this? And the answer is, as it lays
the data out on this disk drive and it wants to get to it fast, it really is helpful to know the size,
the maximum size. Because if you say this is only going
to be four, versus four megabytes, it might lay the table out
differently on the disk. You don’t need to know that. You just need to be explicit
about the contract. Okay? So this is part of making the
data model. Figuring out in advance the kind
of data that you’re going to use, how you’re going to use that data,
and then you communicate that to SQL. And then you don’t worry about all
the magic that makes that super fast. So this is the contract that you make so that your data is really fast. By
being very explicit about your data, you get in return the database does
a lot for you and does it really fast. So now I’m going to go into
my little SQL window and I’ve pasted in that
CREATE TABLE statement. And different user interfaces
will have different things, but they all kind of have a window. And you all put some database stuff in
them, and you hit this execute button. It’s a this little play button in this one,
might be something else in yours, but at some point you go, do it! And down here it says query executed
successfully, CREATE TABLE, yadda, yadda, yadda. So now if we take a look at the data,
we see that we have one table and the table is named Users and
there’s name and email. If we pull this over we’ll
see that the VARCHAR is 128, so the database knows the contract
that we just gave it, okay? So that’s your first SQL command. Your next SQL command is, and you probably should just copy and paste this
stuff off the slides, it’s what I’m doing. So there’s your table,
we’re going to put some data into it. So let’s go ahead and do that. And we can put a new record in there. We can say Chuck csev@umich.edu. We can say another New Record. We can say Colleen. We’ll call her vlt@umich.edu. I want it so they sort right, edu. Another record, we’ll make this be Sally, sally@umich.edu. And then we’ll have another
one we’ll call like Fred. I don’t know, Fred. fred@umich.edu. Now it turns out that everything we’re
doing is actually running SQL, and I want to finish that. And if you take a look
here at the SQL log, you can see all the SQL
that’s going on right here. This INSERT, but let’s kind of
leave that alone for now, we’ll just kind of hide that. But that’s what’s going on right now. And so we made a table and
it now has four rows in it. And now we’re going to do some SQL to We’ve done the create and now we’re going
to do some inserting and deleting and retrieving of the data. So the first thing we’re going to
do is insert another record. Now, what we were just doing in that
user interface was exactly this. Here is another example of
some SQL where INSERT INTO is kind of like a single keyword. The name of the table,
a parenthesized list of the columns and then the word VALUES,
that’s just part of SQL, and then a parenthesized list of
the corresponding values. So Kristin is the name and
kf@umich.edu is the email address. And so that is a SQL statement that
inserts a new record into our Users table. So we’ll just cheat and grab this, copy it, then go to the SQLite browser and
we’ll go to Execute SQL. And I will get rid of the CREATE TABLE
because we already did that, and just type in insert users
with that parentheses. It’s doing syntax coloring for
me which makes it very, very helpful. And then I got the right SQL. Now let’s just say I make a mistake. Let’s just say I call this
ZAP instead of VALUES. I will run this and it will go
like zap is a syntax error and so it does talk to you and then you can
go fix that and say VALUES. And it’s actually helping
me and I can run that. And then it says, success. And I go back and
look at my data structure. I can browse the data, and
there’s Kristin, right? So the INSERT added another row. Pretty simple stuff, isn’t it? Okay, so there we go with that. We inserted another row
with Kristin in it. Now we could delete a row. And so we say DELETE FROM,
which is again kind of like a keyword. DELETE, it could be DELETE Users, but it’s sort of like, we’re not actually
deleting all the users. We’re deleting something from the Users. And so they add this keyword in
SQL that’s kind of redundant. It’s not necessary, but it lets us
read it a little. DELETE FROM Users. Well that sounds like words. Sounds nice. DELETE FROM Users,
the name of the table, and then a WHERE clause. And you can put WHERE clauses on
lots of different SQL statements. And that says, don’t delete all the rows,
only delete where this is true. And in this case we put
a logical question. So this is like a question mark
in a if statement, right? DELETE FROM Users WHERE
email equals ted@umich.edu. This could be one row or there could be a hundred rows
that all have that email address. That would actually be bad to have a
hundred rows that have that email address if that were a key, but whatever. So this just says DELETE FROM Users
WHERE email equals ted@umich.edu. So let’s do that. Again, I just copy this, copy. Go back to my SQLite browser. I could use the Delete Record button, but I’m trying to teach you
this SQL language now. So this is the way to do it. DELETE FROM Users WHERE
email equals ted@umich.edu. I hit the play button. And it says executed successfully. And I go back, and I look at the data. And, was there no Ted in there? I’d better delete a different one,
let’s call Fred. Let’s call it fred@umich.edu. So now if we do the Browse Data,
Fred’s gone, so the delete works. That’s why I do cutting and
pasting on this. So that’s a delete. Update allows us to sort of reach in
to a particular cell or set of cells in a table, many rows or many columns or
many row / column combinations. And we just say UPDATE, which is
the SQL keyword, the table name, SET, which is an SQL keyword, and then
the column name and then the new value. The WHERE clause is important
because without the WHERE clause, it would do it for all of the rows,
so the WHERE clause reduces the rows. So we have a WHERE clause here that
says WHERE email=’csev@umich.edu’. So what we’re really saying is go find
the rows that have that true and change the column name to be Charles. Okay?
The column name to be Charles. So let’s do that one. And hopefully at this point you
are sort of following along and you’re doing your own data because that
will be one of your assignments is to make a database and turn it in. So, I will say execute. Oop. and I take a look at Browse Data and
you see that Charles is now changed. So far so good. You see why I really like this
language. I mean, it makes so much more sense than Python,
but don’t criticize Python. Don’t be hating on Python! Python is powerful and flexible. This is powerful, but not so flexible. So we did a insert. We did a delete. We did an update. Now we have to do the read. The R part of CRUD. And so that is SELECT. And SELECT takes a list of columns,
that’s a column list. And star means all columns. FROM is a keyword and
then Users is the name of the database. And you can have either with or
without a WHERE clause. And so this says select. That really says all rows all columns
from the database named Users. SELECT * FROM Users. Okay? And this says this will turn out to
be one row in our little database, but it might be more than
one if this was true. If this WHERE clause was true for
more than one, okay? So, let’s run those commands. Now, it turns out that we’re kind of
implicitly running those commands, the SELECT * FROM Users,
every time I hit that browse button. Whoops, don’t do that. So if I go over here and when
I do browse, I couldn actually see that in SQL log,
the last thing was that. Okay? And so it’s doing a select on our behalf,
but we’re learning how to do SQL so we don’t want to, we want to
do it the hard way. So we say SELECT * FROM Users. Hit that, and
now down here we see that and we also see in this log as the log
goes by, that it sent that stuff. Now it´s also sending more stuff. We´ll get rid of the log,
it´s too complex. But it´s really just SQL. I´m giving you the simple stuff at the
beginning, okay? So that was a select and I can add a WHERE clause,
WHERE email equals csev@umich.edu. Let’s see if I typed that right. And that should show us all the rows,
but only show the ones for which that’s true. So that’s a WHERE clause, And a SELECT. You can also throw another
clause on the end of a SELECT. You can have an ORDER BY. So in this case I’m going to SELECT * FROM
Users, which is exactly what I did before. Select all the columns
from the table Users, ORDER BY, again a two-word keyword that
rolls off the tongue more naturally. email, which is a column. Sorting is one of the things that
databases do really, really well. And that’s because it actually sometimes
for sorting it has these things. And it has these little tricks that say oh that one
goes here and this one’s the next one. And so if you really want it in
sorted order, have a really quick way
to figure that stuff out. But that magic, you don’t worry about. That’s the database’s problem. So you just say hey, smart database
from billion-dollar company, I would like to order this by email. I don’t know how to do that, I don’t
even know how to write a sort algorithm. I just know that I want it ordered
by email. Now do it, database. And there you go and you see these things
ordered by email and if I want to change it to be order
by name, it’s order by name, okay? So name this time, email that time. Turns out sometimes you can
just click on this header. But we’re learning the tough way. Which is actually not all that tough. So you can select and you can order them,
you can have a WHERE clause and ultimately, this is the summary
of what I’ve taught you so far. An INSERT INTO a table
with a columns VALUES, a DELETE FROM Users WHERE, and
a true / false with UPDATE Users SET, SELECT * FROM Users, or
SELECT * with an ORDER BY. So at this point, you’d be like, wow,
I needed to go to college to learn this? It’s like so easy [LAUGH]. It is easy. That’s why I don’t teach
you SQL until later, so you learn the hard stuff, so you’re tough. Now I can show you the easy stuff. Now, of course there’s
some complexity to come. But at the end of the day when we’re
talking about a single table and the statements that are one table, it looks like a really big
fast smart spreadsheet. And that’s why some of you may
recognize what you’re learning here. You may say, you know what,
I’ve actually been doing this! And I didn’t realize how smart I was! I didn’t realize I was doing databases. But we’re still just at the basics. The power comes with what we’re going to talk
about next and that’s when we have more than one table and we start modeling
data at the connections between things. And that’s when we really start to show
how you can make things really fast. When we exploit the relationships
between tables. So that’s what’s up next.

Video: Worked Example: Counting Email in a Database

Purpose of the Code

The code demonstrates how to:

  • Connect to a SQLite database from Python
  • Create a table within the database
  • Insert data into the table
  • Update existing data in the table
  • Retrieve sorted data from the table

Key Database Operations

  • Create Table: Uses the CREATE TABLE SQL command to define the structure (email and count columns). The DROP TABLE IF EXISTS ensures a fresh start each run.
  • Insert: The INSERT command adds new rows of data (email addresses and counts) if they don’t already exist in the table.
  • Update: The UPDATE command atomically increments the count for an existing email address. This is more robust than reading, incrementing, and re-writing, especially with potential concurrency.
  • Select: Uses SELECT and ORDER BY to retrieve the top email counts in descending order.

Database vs. Dictionary Analogy

The code starts by illustrating how the database functions similarly to a Python dictionary. It stores key-value pairs (email and count), making it easy to keep track of email frequencies.

Other Important Notes

  • SQL Injection Prevention: The code uses placeholders (?) and tuples to provide data to SQL commands. This helps protect against SQL injection attacks.
  • conn.commit(): This command writes pending changes from memory to disk, ensuring data persistence in the database.

Hello and welcome to a Code Walk Through. In this bit of code, we’re talking about emaildb.py. This is a beautiful little example in that it sort of reduces talking to the database
to kind of its pure essence. And so we’ll start out this code and we import the
sqlite3 just to get the library there. We make a connection and that, in databases, we sort of end up with
an open that’s two steps. That there’s the connection to the database
which checks access to the file and the cursor is kind of like our handle. It’s not as simple as you just open it and read it, but you open it and then you send SQL commands the cursor and then you get your
responses through that same cursor. So cur here is the variable that we’re interested in. And the first thing that we’re going to do, is we’re going to. we’ve got this file, it will either create this file and
right now this file doesn’t exist. It’s going to be in the same directory. There’s no emaildb so this is actually
going to create the file when it runs. And then the first thing we’re going to do
is drop the table if it exists, drop table is a bit of SQL. If exist just keeps this from blowing
up if we start with a fresh database and in this case there is no file there so we are starting
with a fresh database. This will accomplish absolutely nothing, which is just fine. Now we’re using triple quotes here. I’m just kind of using that to make
this a little bit easier to read. I could pull those lines up a bit. This one’s actually small enough that I could, maybe I’ll just do that. Let’s do that. Let’s bring that baby right up and
turn this into a single quote. That’s short enough, but triple code is just this one here’s a
little longer so I’ll use triple quote. I’m going to drop table. That’s going to do
nothing first time through, then we’re going to do a create table. Now sometimes your application will have
like a read me or something that says, go run these commands to set the database up
but we’re able to just set this database up
in this particular application. We’ll see later ones where we’re going to leave the database and not start it fresh
and in this one, we can do the same. In this one, we could but we’re just going to start fresh by
dropping the table so we’ll create it. We’re going to have email, an account. Basically, what we’re doing here is we’re really going to pretend that this is a dictionary. If you recall when I said dictionary, a dictionary is like an in-memory database. Well, now we’re using a database to do a database but the first thing we’re going
to do here is pretend it’s a dictionary. So these next lines of code, hopefully are pretty familiar to you, right? You get a file name, loop through it, check to see if it’s, grab mbox-short by default so we can press the enter key,
and then loop through it. And so this little part right here, this is our basic loop that we’re doing
and so that is pretty normal. And if we look at this line right here, that line right there makes sure that
we can only get the From lines, we’ve done that a bunch of times
and we’re going to split it. We’re not going to strip the right
because the split is going to take care of that and then we’re going to grab the email address,
which of course, in the From line is the second part
and then we will have that. So now we’re going to do some database. The first thing we’re going to do, this bit right here is kind of like
the dictionary part. So the first thing that we’re going to do
is we’re going to select count from our database, that is an integer, where email equals. And this part right here
bears some explaining. This is going to be csev@umich.edu or whatever. Now, it is dangerous to put those strings, especially from user-entered
data, into your SQL. You technically could. I could make this be a email equals ‘csev@umich’ I would have to escape the quotes and stuff,
but this question mark is a placeholder. And this is a way to basically make sure
that we don’t allow SQL injection. Go Google SQL injection to get a sense of what that is, it’s more of an issue in online applications
but in this application, we’re just being good. So the way this works is, this is a placeholder in this SQL, that will ultimately be replaced by this. Now you could have several question marks. We only have one in here and so you give a tuple. And if we just put email,
it won’t turn into a tuple, this is a one-tuple basically, this little weird parentheses email comma. Parentheses, that is a tuple with only one thing in it and
that’s just the weird Python syntax. It’s rare that I apologize for
Python syntax but that’s a little bit less than pretty. But it’s okay, it’s a tuple. Normally, if there were like two of these, then there would be email name. So this cur.execute is actually not
really retrieving the data. In a way, it’s looking at the SQL and
making sure that maybe it might verify that the table name is right or if there’s
any syntax errors, etc. So this is actually not really reading the data,
but we have prepared this cursor, This is kind of like the opening of a file
but what we’re opening is a record set. We’re opening a set of records that are going to be this wherever it is true so it’s like
we’re going to read this like a file. Now, later things will loop through this
but we’re only going to say, hey grab that first one, We could have even put maybe a
limit clause on there or something. Grab the first one and give it back in row.
And so row is going to be the information that we get from the database. That is, if there are no records that meet this, then row is going to be null. So here’s kind of again like the get, where if the row wasn’t there
because the way we’re doing this, is we’re going to end up with
this row in the database. Here is this database and there’s going to be two columns and there’s a
bunch of rows and then here’s going to be see csev 4 and zhen 3 and steven 6. So these are accounts and so
we’re grabbing this variable out if it’s csev that we’re grabbing and
that’s going to come into here. That’s going to show up in here and that row is a list but we’re only getting one thing. And what we really are doing is if we search through and we got through and there was nothing,
then row was None means that there was none and we’re seeing like chen for the first time, we will have to insert it. So if row is None, we’re going to run an insert statement, insert into Counts email, count. Now, we’ve got to set it to 1 because
it’s the first time we’ve seen it. So values and then, again, the question mark, the question mark basically says, hey I’m going to have a value in this tuple
and there’s an orderin g to the tuple. So there’s only one question here. One question mark placeholder here and then 1 is the initial count so email, question mark, count 1 away we go and so then again we have a tuple that gives to this execute statement,
just like in that execute statement, the corresponding sort of strings or integers that are to be placed by
each of the questions. So when this runs, there’s going to be a
new record and there’s going to be a 1 that’s put in there into that new record. If, on the other hand, we pull back
a row that exists, we’re going to get this 4 number
and you might think we want to take this 4 number and add it
but in databases it’s always better to do an update because there might be multiple applications that are talking
to this database at the same time. So no matter what update does
is in a single atomic operation, it turns whatever this number is into
one higher and we don’t have to worry about other pieces of code
potentially modifying it. Now in this case we don’t have to
worry about that because we’re the only piece of code but using update to increment something is way better than reading the value and then doing an update to
adding one inside of Python and then updating the new value which is that two SQL statements but it’s also not atomic. So if the row exists, we just know that it exists and we
just want to add one to the number. We do have the number sitting here in the
row variable but we don’t need it. And so we’re going to say, update count, set count equals count plus 1 comma name, where email equals and then another placeholder and then another tuple
for the question mark. Okay? So that’s what this little bit of code does. That is kind of the the read it, parse it, check to see if it’s there, if it’s not, insert it, if it is, update it. And so then, we see this conn.commit. And this conn.commit, basically the
way it works is that the database is efficiently keeping some
of the information in memory and at some point, has to write all that stuff out to disk. So you can choose at times where you
put this commit and right now, we’re going to commit every time
through this loop but you might commit every 10th time through the loop
because the commit will take some time because it forces everything
to be written to disk and these can run really fast and
the commit is the slowest part here. So sometimes we do things like commit
every 10th record or every 100th record. If it’s an online system, which is not what this is, you have to commit at the end of
every sort of screen thing, but for this kind of a system because we’re putting so much in as
kind of a bulk insert, we might come up with a thing where every one
every 10th time we do a commit. But ultimately, what this will do,
when this is running, is it will build up slowly but surely
adding new records and then 1, 1, and then a 2 and a 3 and all these things and add
another 1, that will be 1. It’ll do this thing, right?
And at the end of the day, that is what’s going to be in the database. So let’s take a look at what’s in the database and now we
can actually read the database and so, in the database, we’re going to run a select
and we’re going to select the email account from counts order
by count descending. So look at that, isn’t that cool? We’re getting in the top ten because databases are good at sorting and they’re good at
all these other things. So we’re going to then execute this and then
we’re going to ask for the rows one at a time and the rows are going to be a tuple and row sub zero will be email
and row sub one will be count. So we run all this stuff and then we
close the connection and away we go. Okay? So let’s go ahead and run all this stuff
python3 emaildb.py. It asks for a filename mbox-short. I can hit enter right mbox-short. And that’s it, and it looks just like that
and it counts it and away we go. The difference is, at this point, we have a file emaildb.sqlite and we can run the sqlite browser and we can then open this database and we can see what’s in there.
So here we go. It has made an SQLite database. We have a table of counts and then we can
take a look at the data and there we go. We’ve got the data and we can do this. Let me close this. It’s important at times when you
don’t want necessarily to have, let’s see if we can cause it to lock up. Let me run this again and it’s going
to drop this table so I’m going to run the code again but this time, I am going to do the full one, mbox.txt. Now, we’ll see what happens here. But it ran so what we have to do then to see this data, this is from the previous run but
if we wanted the most recent one, we hit refresh and then away we go
and so we can see this stuff. And so this is just a real simple start to see how you can connect some of the stuff that we’ve been doing but store the data in a database, but the nice thing about the database is that it can store this stuff from run to run even though, in this case we’re dropping the table every time. In later things, we will see how we can
store data from run to run to give ourselves more restartable processes.
Cheers.

Assessments


Quiz: Single-Table SQL

Structured Query Language (SQL) is used to (check all that apply)

Which of these is the right syntax to make a new table?

Which SQL command is used to insert a new row into a table?

Which command is used to retrieve all records from a table?

Which keyword will cause the results of the query to be displayed in sorted order?

In database terminology, another word for table is

In a typical online production environment, who has direct access to the production database?

What happens if a DELETE command is run on a table without a WHERE clause?

Which of the following commands would update a column named “name” in a table named “Users”?

What does this SQL command do?
SELECT COUNT(*) FROM Users

Graded App Item: Counting Email in a Database

Reading

Code

Bonus


Video: Bonus: Office Hours Zagreb, Croatia

Student Profiles

  • Marco: Graduate student in electrical engineering and computing, using Coursera to expand knowledge and enhance his CV.
  • Anamaria: Student of philosophy and social humanistic informatics, using Coursera to explore interdisciplinary interests and increase job market value.

Why they use Coursera

  • Broadening knowledge: Both students seek to learn beyond the scope of their formal studies.
  • Career development: They see Coursera certificates as a way to stand out to potential employers.

Concerns about Coursera

  • Future cost: Anamaria fears that monetization may limit Coursera’s open accessibility.

Suggestions for improvement

  • Faster: Marco wants a faster user experience (this might refer to course pacing or platform responsiveness).
  • More Engaging Anamaria wants the learning experience to be fun, like the courses led by Chuck and Don.

Hi, Chuck here. I’m in Zagreb, Croatia. And here we are at another
instance of an Office Hours. The extended edition. Tell us your name. >> Hi, I’m Marco. >> Tell me what are you doing
in education right now? What degree are you pursuing? >> I’m studying faculty of
electrical engineering and computing at University of Zagreb,
and I am graduate student. >> And so
how many Coursera classes have you taken? >> I’ve taken two courses,
yours and Introduction to Python. >> Why do you use Coursera? Why do you add in Coursera
to you education? >> Because first of all I
want to broaden my knowledge. >> And after all, some, how do you say? To show something in my CV that I have taken other courses
than studying at a faculty. >> Do you have any suggestions for
improvements? Or what would you say if you were
talking to the people who ran Coursera? >> Just make it faster. >> Make it faster? >> Yeah. >> Okay. So, tell me your name. >> My name is Anamaria Cudanar. I’m student of philosophy and social humanistic informatics,
graduate and undergraduate. What was the question? [LAUGH]

So tell me about how many Coursera
courses have you taken? >> I’ve taken four,
Guide to Irrational Behavior, How to reason and argue, and
human computer interaction. And the fourth one was artificial
intelligence at Stanford. >> Did you finish them all? >> No, I finished only one. And I got a certificate from
how to reason and argue. So how do you fit it in with the rest of
your education, I mean how does it fit, and what part of your education
does Coursera fulfill? >> The things that I learned
are outside of my study scope, I’m really interesting in a lot of
things and I think that people should be more prone to interdisciplinary,
So I want to learn things from other places, and I think that makes
me more valuable at the market, and I’m really genuinely
interested in a lot of things. >> So is there anything that worries you
about Coursera, does it bother you, or do you have some fears about the future? Or do you think it’s like
Fears? I fear that it will become,
when money gets involved, it will no longer probably able to be
open-source because somebody has to pay it, and
I think it’s only a matter of time when it will get,
somebody will have to pay for it. So, I’m afraid that it will no longer
be available to a lot of people. >> Okay.
Do you have anything you would say, to suggest to improve it? That you’re a professional and
sort of learning technology. Do you have any suggestions? >> Just make it fun. Like you or Don already does. >> Okay. >> Okay, so. >> So there you have it, the voice
of the student from downtown Zagreb, and we’ll talk to you next time.

Video: Interview: Elizabeth Fong – The Early Years of SQL

The Need for Database Standardization

  • Software Diversity: The speaker highlights the wide range of database management systems (DBMS) that emerged, each with its own strengths and weaknesses.
  • Application Portability: Users wanted applications (like student course records) to work across different DBMS platforms (Oracle, Microsoft SQL, etc.). This required a common language.

Key Historical Developments

  • Early File Systems: Hierarchical structures were common (like in IBM’s IMS ), but debates persisted about the best model (tree, network, flat file).
  • Metadata and Schemas: The importance of self-describing data (now called metadata or schema) was recognized early on.
  • The Database System Study Group: This group laid the groundwork for defining what a DBMS should do (store, retrieve, modify, etc. data).
  • ANSI and X3H2: Formation of standards committees to develop a common database management language.
  • Birth of SQL: Chris Date and IBM championed the relational model with its table-based structure and the simple yet powerful SQL (Structured Query Language).

Standardization Principles

  • Focus on Interfaces: Standardization doesn’t restrict capabilities, but rather ensures common communication protocols.
  • SQL as the Standard: SQL became the agreed-upon way to work with relational databases.
  • Conformance Testing: Certification labs, like NVLAP, test DBMS implementations to ensure they follow the SQL standard.

Market Impact

  • User Power: Standardization gave users the power to demand SQL compliance, driving vendors to adopt it.
  • Timing is Key: Standardize too early, you stifle innovation; too late, you miss the chance to unify the market. SQL is seen as a successful example of good timing.

[MUSIC] And they were part of the coming,
going from wild. And then we realized that,
hey, I want to buy a product. What kind of a feature do I need? Another one of the things that I
have done is the feature analysis. So we have about half a dozen of product. And the standardization
are really building a consensus. We brought from a matured technology that are ready to be
standardized where people say, do I buy IBM or do I buy Oracle,
or shall I buy a cheaper one? This kind of a decision started emerging. And selection of, I mean, how many variety of application you can build on
top of a software foundation, what I call a database management system,
is too varied. And so you want to have
some sort of a standard so that your application can
work on different platform. We work with files, and
we call it the file system. The files are hierarchical,
they were in the IMS, the IBM information management system,
which is a tree structure. And the debate was going on
whether there should be a tree or a network or a flat file. And we are still debating
whether the data have a self-describing tag, and later on,
we known it as a metadata. And now, people call it the schema. So the Database System Study Group
come up with a reference model or a specification for a minimal functionality
of a database management system. In order to be a database management
system, you want to be able to store data, retrieve data, modify data, organize data, delete, manipulate data,
and it becomes a spec. And during that time
there was a birth of a, we initiated a birth of a ANSI group. It’s now called INCITS, the ANSI
American national standards group. And it’s called the X3H2
of which Don Deutsch and people like Len Gallagher
all participated in that. That group is called the data
management language. In order to standardize anything, you realize that you can have a lot
of light bulbs, for example. You can have red light bulb,
white light bulb. The only thing you want to standardize is
when you want to talk to another person. Communication, interface, or a area where both of us have to understand
a common vocabulary, or whatever. So the standard, the only standardization of a software system is not
the capability but a language. And relational database
at that time was IBM, Chris Date, and
he’s talking about normalization. He started talking about flat file and he called it a table, and it’s a very
easy concept that everybody understand. So in order to retrieve a table, you say select from a column such and
such, from a table of employee, and there was a birth of
a simple query language. Testing part is also very important
aspect of when you adopt a standard. You want to certify that
the product conform to such and such version of ISO standard,
or JTC 1, whatever it is. So that gets to be if your,
otherwise your app won’t work. Your application, let’s say you
build a student course record. And no matter, you got Oracle, you got a Xybase or Microsoft SQL, you want your
application to work no matter what. And that’s what the marketplace
wanted to go, the user. Of course Oracle, or Microsoft, buy my
product, Oracle would say buy my product. And in the procurement, you say, I want to be compliant with SQL. And so you have to have
the conformance testing certificate. And we have the NVLAP,
it’s a laboratory that’s certified and gives you a validated product list. Here is the list of product that have been
validated that they conformed to it. And you can buy from that list,
but there is a requirement. This is strictly user. You’re buying, because
you’re paying the money, not us. Timing is everything. You can’t standardize a thing too early, or you drive a lot of
innovative concept away. Because people say,
hey, there’s no way for me to get into your market,
because you already decided. And even though it’s not very good,
I won’t get into that business. So, you killed innovation. And if it is too late,
you missed the opportunity, you get too many variety of things
coming up, and the choice is too much. But, of course, SQL is one of
the success stories that we have. [MUSIC]