Week 5: Databases and Visualization

In this section, we put it all together, retrieve and process some data and then use the OpenStreetMaps API to visualize our data.

Lectures

Video: 16.1 – Geocoding

Notes

Transcript

Key Concepts

Multi-Step Data Analysis: Often involves separate processes for data gathering, cleaning/processing, and then analysis/visualization. This makes large or complex tasks easier to manage.
Restartable Processes: Important for gathering data from sources with rate limits or where network issues may occur. It allows you to resume progress without losing work.
Geocoding: Translating addresses into precise map coordinates (latitude/longitude).

The Geocoding Workflow

Gather Raw Data (geoload.py):
- Retrieves geocoding information for locations listed in ‘where.data’
- Stores data efficiently in a local database (‘geodata.sqlite’) for restartability and to avoid repeating lookups.
Clean and Prepare Data (geodump.py):
- Parses the raw data from the database.
- Prints a summary of retrieved locations.
- Creates a JavaScript file (‘where.js’) with the necessary map data.
Visualize (where.html):
- Loads the JavaScript data from ‘where.js’.
- Renders a map (using Open Street Map) with pins marking the geocoded locations.

How to Make Your Own Map

Modify ‘where.data’: Add your desired location to be geocoded.
Run ‘geoload.py’: This will fetch the geocoding data and update the database.
Run ‘geodump.py’: This ensures your data is correctly parsed and the ‘where.js’ file is updated.
Open ‘where.html’: Your browser will load the map with your new location marked.

Key Takeaways

This example emphasizes the value of combining basic web data retrieval and parsing skills for a practical application.
While not a professional-grade tool, it demonstrates core data mining concepts and offers a fun, customizable way to visualize data on maps.

We’re now going to start
putting this all together. We’ve been learning
for a long time. We’re going to
actually now retrieve data from various sources and we’re going to visualize it. That’s what we’re
going to do next. I’m not trying to turn you into the world’s greatest
data visualizer because there’s
lots of different technologies that
you’re going to use. I’m just trying to give you some fun examples to show
you what can be done. The next few things
that we’re going to do are all examples of what I’ll call a multi
step data analysis. The key thing is, is
there is like a database. We do things up to this point where you
just run some Python, it retrieves some data, does some parsing or
whatever it does, and it throws it away, and
then the program ends. The problem is, is what
if you have to retrieve a million things from
a rate limit API? Or what if your quarter of the way through
a million things and your network goes down? Or maybe your laptop
goes to sleep? You need a restartable process. In this, we find that
it’s a good idea to separate into two completely
different processes. The sort of raw gathering, and that is you’re going
to hit this data source, you’re going to loop through it, and instead of
trying to clean up your data and figure out
or parse it or whatever, you just start this really
simple gathering process that writes to a
local database on your computer or in
the Cloud somewhere, and it says, okay, which
ones have I retrieved? Which ones haven’t I retrieved? Like if this is a
spidering process, you might have a
bunch of web pages and you retrieve some web pages, and then you look at
them and find more links and then you retrieve
those pages. The idea is that you have this restartable
gathering process that doesn’t do a
bunch of cleaning, etc, it just gets the raw data from the
Internet into your computer. Then what happens is you
build a cleaning or parsing, or who knows what
it is that you’re going to do with
each of the things, whether it’s a JSON
data or a web page? Everything that you’ve retrieved may need some cleanup
or some checking, or you might even make
a thing where you say, look, this was bad and I want to redo this and then restart
the gather process, and it finds all
the parts that it needs to regather
and away you go. There’s this cleaning phase. Now, this gather process when you have a lot of data
that might take a day or two. But the cleaning and the processing phase
that goes from one database to the
other database, that might run in a few minutes, which is why you take
the time because the gathering process
is the process that talks to the Internet that might get in trouble
with rate limiting. But the data that’s
already on your computer, you run a Python program, goes, runs through it
really, really fast. Then you get the clean data. Then in yet another process, you might do some analysis
of this clean data. You might write some Python
and do some dictionaries, some histograms,
or you might even write some visualization code. In the example I’ve
got, you’re taking the cleaned up data and you’re visualizing it and
making a map out of it. You’re even going to do this. Now, like I said, this is not professional
data mining. But I will say that you
can impress your friends. Show that you did a page rank
algorithm or show that you did a geodata visualization. I’m handing this all
to you in ZIP files. Mostly you just have to run the ZIP files and do
impressive things, and you can do some
cool visualizations. If you’re going to get serious
about it, like at a job, there are all kinds of really scalable data
mining technologies that you use rather
than writing your own. Some of this stuff about like gathering and getting the
raw data in a database, there are tools that do that. This is just scratching
the surface, but we’re going to
do it all ourselves, which is really cool, so
that you know how to do it. Then when you go use
a more sophisticated, more professional data
mining technology, then you know what
it’s doing at least. You’re like, if
it works or blows up or has a problem,
you’re like, oh, I know how to do what this fancy tool is doing better than my code
would have done, but you know how to do it. Like I said, this really is
an exercise in software, not an exercise in
the best practices of data mining or
data mining experts. But like I said, it’s
like the key that opens for you the
concept of data mining. The first thing
we’re going to go through is geocoding data. I love geocoding data. Like I said, you can put any address in and it eventually
it’s a pin on the map. That’s kind of the assignment
that I’m going to give you. Now, be careful when you’re
doing this assignment that you really put
your pin on the map. These looks like it. You’re
supposed to find a place, you’ve put your own
pin on the map. Not just take my pin, not
just take a screenshot, you’re supposed to
actually run it. What’s going to do is
we’re going to use the proxy GeoAPI data so that you don’t get in
trouble with rate limits, you’re going to copy it into your local database
that can be restarted, and then you can visualize
it in the browser. One of the nice things about the two phase process is you can pull in a couple
hundred things and then you can play
with the visualization. You don’t have to wait till
you pull all the things in. You’ll use Open Street
Map to visualize it. This is the code that
you’re going to run. Again, we’re taking that
Open Street Map data. We got an application
called geoload.py. When geoload.py takes
a list of locations, and this is where you’re
going to modify it. It takes a list of locations to go and look up the geocoding. Your assignment will be to add a single item to where.data, then run geoload.py, which will go get all the
ones that are already in where.data plus yours. That’s going to build a
database called geodata.sqlite. Again, that first process that geoload.py is a
restartable process. You can hit control C or control Z depending on
your operating system, and it can be run
and it can blow up. Then you’ll start it back
up and you’ll notice it’s smart enough to not do
the same thing twice. Because it looks in
geodata.sqlite and says, oh, I already got that one. It’s an efficient, restartable,
spidering like process. In this case, it’s spidering an API versus
spidering web pages. At any moment in
time, you can stop geoload and then run geodump. What geodump is, it’s kind of the parsing
of the data that’s in sqlite and it both prints it out so you can see
what you’ve got in there. It prints out the
locations that you looked up, including the address. This is a list of
universities basically, and the address of
that university, I think this might
be the data from 2013 when I first started making this
class a long time ago. This is the 2013 data
from a survey that we did about where
students were coming from. Geodump reads through
geodata.sqlite and it dumps out the data and it’s also parsing the data
because if you recall, the Open Street Map is
a pretty big thing, but what you’re seeing there is a down sampling of that
data on the printout, and it’s also writing
a file where.js. The where.js is a
JavaScript array basically that has all
these positions, etc. That is where.js gets rewritten every time you
write the application. Then if you just
open where.html, and I give you all these
things in opengeo.zip, that glo, geodump,
where.js, where.html. Then when you open up where.htm, it reads where.js and puts a
bunch of pins in the thing, including the pin
that you added for the location that you
added to where.data. There you are. You could use
this to visualize anything. You could just come up
with your where.data, which is a bunch of geolocations that you want geocoded.
There you go. This is a lot of
fun. It’s the payoff for all the hard work that you’ve done or you’ve got a real pretty picture when
it’s all said and done. To summarize, we have explored lots of ways of
accessing data from the web. This is our first
real application that we’re bringing
it all together, but for the same
set of skills of retrieving things like
urllib and parsing JSON, HTML, etc, we’ve
learned all that stuff. This is only the beginning
and you can take this in a lot of different directions
and it’s a lot of fun.

Assessment

Video: Worked Example: Geodata (Chapter 16)

Notes

Transcript

Problem: Efficiently Geocoding Large Numbers of Locations

Geocoding APIs can be slow, rate-limited, and expensive.
We need a solution to handle many locations without constantly hitting API limits.

Solution: Restartable Geocoding with a Database

geoload.py:
- Reads location list: Processes locations from “where.data”
- Talks to API: Fetches geocoding data (latitude, longitude) using a proxy to avoid hitting real API limits directly.
- Stores in Database: Inserts retrieved location and its JSON data into “geodata.sqlite” for persistence.
- Restartable: Checks the database before hitting the API, avoiding repeated work and allowing you to resume after interruptions.
geodump.py
- Reads Database: Extracts data from “geodata.sqlite”
- Creates where.js: Generates a JavaScript file containing the location data in a format ready for mapping.
- Writes debug output
where.html and where.js
- Visualization: where.html loads where.js and uses Open Street Map to display the locations as pins on a map.

Key Takeaways

This setup allows you to fetch geocoding data incrementally without overwhelming the API.
The SQLite database makes the process restartable, so you can pick up where you left off.
You can add your own location to “where.data” and see it visualized on the map.

Let me know if you’d like a more detailed breakdown of any specific part!

Hello and welcome to Python for everybody and
another code walk-through. In this one we’re going to walk through a
Chapter 16 diagram. Let’s take a look at it. This is a situation where
we’re going to be talking to some open street map
data through an API, through a proxy to that API and we’re going to write a bit of code that’s going to
repeatedly pull this stuff down. Now, these APIs are
notoriously slow, expensive, difficult, and sometimes
even unreliable. If you hit a rate limit, it
starts blowing up on you so we want this to be
a restartable process. What we’re going to
do is write this code called geoload.py
and it is going to read a file that is the list of the locations we
ultimately want to have. When you’re doing your homework, you got to edit this file. But the idea is that this
geo load is going to read all the locations
we want and then starting to pull them down
and then it’s going to simply insert the JSON into this
geo data SQL light file. Then this part here is a restartable
process the geo load is a restartable process. When we do it, we’ll start
it a couple of times. It’s smart enough to know
the things it already has and not start from the
beginning over and over. That’s why it’s a
restartable process. At any given point,
we can in a sense, view and visualize the
current contents of this database with geo dump.py. That reads the database
and it produces a little file called
where.gs it also writes a little debug
output and then that where.js is read into
when you open a browser. This file where.html
opens in a browser and then it reads the
where.js and then it visualizes it and
that’s really cool. At some point, you’re
supposed to put your own location in where.data and then show in the
visualization that you were able to
get your location. Without further ado let’s
go take a look at the code. The code you’re going to
need to use is opengeo.zip. You download
wwwpyfree.com/code/opengeo.sip and then you open that into a
folder and you will have a folder called Opengeo. If I’m looking at my
current folder of Opengo, you can see that it’s
got to red text, geodum.py, geoload, where.data, where.html
and where.js. The where.html and
where.js those are the visualization part
of it and geoload is the spidering part of it and geo dump is the
read the database. Let’s go ahead and start
taking a look at geoload.py. This is the restartable
spider and if you look, it’s going to be looking
at a proxy of the geoapi. There’s the open
street map and we can search for things
in open street map, do this geocoding by hand and we’re doing it every
time we do a map search, a geocoding is happening. We’re talking to GeoAPI, which in effect takes the free data but then
wraps an API around it. This is the API we’re
going to get but if you’re going to use it, you got to get an account.
So you got to sign up. What I’ve done to
simplify things in this class is I’ve
made a proxy for it. This proxy does
not require a key, but it has really
fierce API rate limits, meaning that it slows down terribly if you use it more than what you should
use it in this class. But that slowing
down also is fun because you are doing
a restartable thing. You don’t need a key
to talk to this, it’s just for this class. The protocol is pretty simple. You have a query
parameter which is the text location
properly URL-encoded, spaces are encoded as pluses
and commas are percent twos. That’s the kind of
thing we’ve got to do. If we take a quick look back at our code and we looked at
this code in a previous, we only retrieved one
now we’re building and restartable process
using a database. We have this prefix of it. Now this is database so we’re
going to create or open an existing database and
in this case we don’t yet have a database, ls-l, there is no database. You’ll see the first time we
start this we’re going to create this database and we’re going to create
a simple table. Now the key thing is that all we’re going to
do is we’re going to take the address
that we looked up as the key and the JSON that
we get back as a value. In this case Ann Arbor, Michigan is the key and this raw data which looks
prettier, that is the value. That’s what we’re going
to do because again, this is the restartable process. You want this
restartable process to be simple and restartable. Then if we continue
down in the code, we see the certificate problems having to do with the
fact that there’s not a lot of good
certificates that are default included in Python and so it’s
simpler for now. In production, you can figure this out a little bit better, but that just make sure that our URL lib stuff
works. There you go. It’s unfortunate
because when they fixed this thing about
they didn’t like the certificates, they
had some problems. They just took them all out. We went from a lot of good certificates
and one bad one to no good certificates and then everyone immediately went and ignored the certificates. The law of unintended
consequences. Let’s take a look at
this file, where.data. Where.data, the idea
at some point is you want to know the locations. This is just the input to R and this is just
all the places, and we can add to this Jerk Pit, Ann Arbor, Michigan. This is the to-do
list, in a sense. That’s what we’re going
to do and I give you a starting point and you’re
supposed to in your work, add something to it. We’re going to read this
where.data and that file had, I don’t remember, 250, ’75, but you might have
thousands of lines. Then we’re going to read
the whole file and if we get past 100 of
retrieve locations, we’re going to stop, and that’s to let our API settle down. You change these numbers, and then we’re going to take
the address from the line, which is just a text line, and then we’re going
to check to see if the address has
already been retrieved. We’re doing an encode in a memory view
because of UTF-8 and because the database is
outside of us. There you go. Then what we’re going
to do is we’re going to retrieve that record. If the address was
already in the database, at the beginning it’s not
because the database is empty, we’re just going to continue and go zooming through this file, not retrieving the ones that we’ve already retrieved that are in our thing already. But if they aren’t there, then we’re going to create URL. This is basically the pattern we’re going to create
in the dictionary, and we’re going to say q
equals Ann Arbor, Michigan, but that Ann Arbor, Michigan has to be
properly URL encoded. That’s what this
urllib.parse.urlencode does. That ends up with a URL
that looks like this. It all gets
concatenated together, but that encoding is something we don’t have
to write the code for. Thank heaven, because
even though it’s easy to explain in
a simple example, it’s hard to do it
right in all cases, so we’ll just let the
libraries do that. Now we have a URL,
so we open that URL. This little context = ctx is the trick we use to
ignore the SSL errors. Then we read it. But the stuff that
comes back, of course, this here, if we were to look
at the headers, let’s see. We could look at the headers
and you’d see that this is a UTF-8 application/JSON. Python internally
does not use UTF-8. Python internally uses Unicode, so decode says that
stuff we just read, UTF-8, decode it into
our internal format, which is Unicode,
and there you go. Then I put out a debug statement
and I increment my count so that each run only does
own no more than 100. Then I try to parse
it using json.loads. Then I have a couple
of sanity checks to make sure that I got real data. If I didn’t even get data, I should just complain about it and either break or continue. Then I am going to insert
it into the database. Insert into locations,
address, geodata, encode the address, and encode the data,
and that’s it. I’m going to commit
it and I’m going to sleep five seconds everyone, and that goes round and
round and retrieves them. Then at the end of the loop, it prints the number of features and sends
a little message. Let’s just go ahead and run it. Some of this code we
talked about is just adapted from a couple
of chapters ago. Let’s type Python geoload.py. Now it’s going to start reading where.data and hitting the API. You see, there you go. Now I’m going to
hit “Control” “C”. It did the first 10
and then it hit this, and I just hit “Control”
“C” to blow it up. If we look in our database
now, let me start that. If we look at our database
and we open a database. Where am I? I’ve got
to find where I’m at. You will have to find
where you’re at. I just have it on
my folder names. I’m working right
in my main thing. There we are. I found that
file so that file now exists. If I take a look at the data, I see I have one table, it’s the locations table, and I see the address, and then I see the data itself. There we go. That’s all I did. I just went through
the first 10, and so they’re already there. Now let’s go run this thing
again and just start it up. Now what it’s going to
do is it’s going to read the whole file
of where.data. But you’ll notice that it has
the first 10 already done. It just says got it. It’s very rapidly
through the first 10. Now it’s grabbed the next 10, and now it’s grabbing
the next ten. Again, let’s just
say your network goes down, the whole
thing blows up. Well, let’s take a
look at the data. We’ve got 30 now. This might be 3,000
instead of 30, but you come back in the
morning like, oh, I blew up. What do you do? You
start it up again. Now it goes, oh, I got all
those. Here’s the next 10. Here’s the next 10.
Let’s just blow this up and let’s briefly comment this out so
it gets done faster. Now it’s not going to sleep. It’ll say pausing for a
bit, but it’s going to go. Now it’s retrieving, we’ll see how it has got to go to 257. Again, this is good data, it’s going through
257 of these things. I didn’t make it do
25,000 or a million, but this pattern could work. It went to 100 so
I got to run it again because it had
another limit, only 100. We’ll get there.
Just keep loading. I want to get to my thing
where I have to mind there was an error and you could have gone and looked at
what that error was. I still have to get it
again. Got to run it. I should have just
changed the line. Actually, you can see it’s slow. It hit the rate limit. Let
me put that sleep back in. Interestingly, now if we run it, we’ve been through,
well, let’s go look at our database
for a second. Let’s reload the database
and we see that, there we go and the jerk
pit is already in there. That’s the one I added. There’s 273 in here. Now, watch what happens
when I run this again. It’s already done. Let’s take a look at the thing because what it was doing
there is it was reading all of the things in where.data and it was
checking Geodata SQLlight. They’re already
there so there was no need to go to the API. That’s what I mean by
a restartable process. You don’t need to
go back to the API and so I can run this
as many times as I want and it’s instant basically because there’s only 250
things read from a database, which happens really fast. In a sense, the way you
can think about this is this first phase is done. We restarted it a couple
times, we did this, we did that and now
we want to work on this second phase
where we run geo dump and all that’s going to do. Let’s take a
look at the code. These two look very different. Geo load worries about the fact that the API
may blow up on you, or it may be slow, or
you may run out of time. This tool, geodump.py it opens a database and then it just
reads all this stuff and it starts writing this file where.js file and a
funky weird thing, and it basically
reads the first row, just reads the rows, and then it parses the JSON and then it checks to see if that
features is there, which is the outer thing. Let’s go look here. Raw data, pretty print features
is this outer thing. It’s an array of features but we’re just
going to look at the zero with item and then we’re
going to look into side properties or let’s
look at it this way. We’re going to go into features, and that’s an array
so we’re going to go into the sub-zero, and then we’re going
to go in properties, and then side properties we’re going to grab the longitude, the latitude, and
whatever else it is. You can see these little things. We’re going into the
JSON sub features, sub-zero because features
as an array sub-geometry, sub-coordinates the zero in the first is the
longitude and latitude. Let’s go ahead and run that. We’re going to print it and
we’re actually writing it into this where.js
file at the same time. This is also going
to be really fast. That was super fast because it was only reading
from the database and writing to where.js and then there’s something wrong
with this particular one. I’m not going to worry
too much about that, but if I look at where.js it’s just a Javascript variable that is all of this data
with longitude, latitude, and the name and that’s how you
put pins on a map. If you watch where.html
this is not an HTML class. I’m just giving you this. It does a whole bunch of HTML to use OpenGeo to
visualize this stuff. Now all I’m going to do
is I’m going to say Open where.html and there we go. This is the map.
These are all of the retrieved locations and if I click on one of these
things, you see this. I can find my location
somewhere on here. I have one link in the
University of Michigan is where it decided the center of the University
of Michigan is. I’m not sure I agree with
that but hey, that’s okay. I guess it’s open street map, but the second one was
the one that I put on and that is the
Jamaican Jerk Pit, which is only one of my favorite restaurants
in the whole world. There you go, the
Jamaican Jerk Pit. My office is currently in
this building right here, and I walk less
than half a block to the Jamaican Jerk pit. That’s basically
a walk through or our restartable
geocoding process and then visualization
using Focus Street Map. I hope you found
that useful. Cheers.

Bonus

Home » University of Michigan » Python for Everybody Specialization » Using Databases with Python » Week 5: Databases and Visualization