Week 5: Web Services and XML (Chapter 13)

In this section, we learn how to retrieve and parse XML (eXtensible Markup Language) data.

Table Of Contents

Lectures
Assignments
Bonus Material

Lectures

Video: 13.1 – Data on the Web

Notes

Transcript

Summary of Chapter 13: Web Services

Main ideas:

Web services add structure and formality to data exchange over the internet.
They shift the focus from human-readable documents to data consumption by programs.
Two key concepts:
- Wire protocol: An agreed-upon format for data transmission across networks, independent of programming languages.
- Serialization/deserialization: Converting data structures within programs to/from the wire format.

Key points:

Network data format needs to be independent of specific programming languages (Python, Java, etc.).
XML and JSON are popular wire formats, with XML being older and more complex, and JSON being lighter and more modern.
Designing an optimal wire format involves balancing language neutrality, efficiency, and ease of use.
Web services focus on data-oriented documents, optimized for program consumption rather than human readability.

Next:

A deeper dive into XML as the first example of a wire format.

So welcome to Chapter 13. We’re going to talk about web services. And we’ve been talking about moving data
using the request/response cycle and HTTP and urllib. Web services is really adding
a layer of formalism on top of that. Where we’re just being a little more
formal about how we do this and basically, at some point we’ll just
switch from it’s moving data back and forth to these are are APIs,
application program interfaces. And so, like we’ve said before,
this request/response cycle that was originally for documents and
images, has been used for data. And we have been coming up with ways
to move data in a way that really have nothing to do with humans viewing them,
but instead have programs producing and consuming this data. And the basic idea is
you have data in a program, and so we’ve got two programs here, and they’re
going to communicate across the Internet. So we might have a Python program
that’s producing the data. Maybe it’s reading a database,
maybe it’s reading a file. Who knows what it is. But inside it has a Python data structure,
like a dictionary. And we want to send that
across the network. Okay, and so the network is not Python. The network is not Java. The network is a data, I mean it’s
data that’s going to go across. And so we have worked out, over the years, what we call the Wire Protocol, or
how the data is put on the wire, or how the data leaves one system, transits
a network, and then enters another system. And in that destination system,
it’s not always Python. It could be another program. And so, perhaps, our Python dictionary in this other
system needs to be a Java HashMap. And so, we can’t say that we’re going to
send Python data across the network, and we can’t say that we’re going to
send Java data across the network. We just have to send network on
some format that we agree on. And so we have to argue about
what the format is and say, okay we’re going to do this, and this XML,
which is one of the wire formats. And it’s, okay we’re going to take this
data that’s in a Python dictionary and we’re going to, XML looks kind of
like HTML and it has less thans and greater thans as tags. And we’re going to send
a person across the network, person that’s going to have a name and
a phone number. That’s the data we’re
going to send across. And we’re going to say,
that’s our wire format. It’s not how Python thinks about it,
it’s not how Java thinks about it. It is an agreed on intermediate
protocol that is just text, right? It’s not internal memory. And the act of going from an internal
representation on one computer out to a sort of interchange format
is called serialization. And that has to do with the fact that,
in the old days we had these wires, and we sent the data serially,
across one character at a time. So it was taking, from the
internal memory of the computer a format that we could sort of send one
character at a time, character, character, character, character, character, so
we called this a serialization format. And so, then the act of taking the data
off of the wire and turning it into a new internal data
structure, in the new environment, potentially in a very new language,
is called de-serialization. So we take our internal structure,
serialize it, send it across the network, then we receive it. We de-serialize it, and then we use it
in this other programming language, in whatever structure makes sense,
in that particular programming language. The two types of
serialization formats that we’re going to talk about are
XML and JSON. And so those are the two. And XML’s kind of like
the older of the two, and JSON is the more modern of the two. XML is the more complex and, some would say,
more rigorous of the two, and JSON is the lighter-weight version of it. So you take your Python dictionary,
you produce JSON. You send JSON across the network
as a string or a document, and then you receive the document, and then you turn it into whatever it is
it’s going to be on that far side. So that’s the basic idea of
agreeing on data formats. And so you argue here in the middle, we
can argue, the Python people can come to the argument, the Java people, and the
JavaScript people, they can all come and argue about what the best wire format is. And that kind of engineering of
an interchange that is not particularly suited to any language better than any
other language, is part of the argument of building these data oriented documents
versus sort of human readable oriented documents. So, the first thing we’re
going to talk about is XML. And, as the first of these two
formats we’re going to talk about.

Video: 13.2 eXtensible Markup Language (XML)

Notes

Tutorial

Code

Transcript

XML: Simplified Guide for Sharing Structured Data

What is XML?

Extensible Markup Language (XML) is a format for structuring and sharing data.
Think of it as a flexible container with clear boundaries and labels.
Similar to HTML, it uses tags (start and end) to wrap and identify data chunks.

Key elements:

Tags: Define data sections (e.g., <person>).
Attributes: Specify additional information within tags (e.g., <person age="30">).
Text: Content within tags (e.g., “John Doe”).
Nodes: Elements and text are nodes forming a tree-like structure.

Comparison with HTML:

Flexibility: XML tags are custom-defined, unlike HTML with predefined tags like <h1>.
Purpose: XML focuses on data exchange, while HTML structures web pages for display.

Benefits:

Standardized format: Ensures accurate data exchange between different applications.
Structured data: Clearly organizes information for easy interpretation and processing.
Flexibility: Adapts to various data types and structures.

Understanding XML structure:

Tags: <node> and </node> enclose data.
Attributes: Key-value pairs like name="value" within start tags.
Text nodes: Contain actual data between tags.
Parent-child relationships: Nodes can have child nodes within them, forming a tree hierarchy.

Parsing XML:

Navigating the tree structure to access specific data points using paths (e.g., /a/b/text).

Next steps:

Learn about XML Schema for defining valid data structures.
Explore tools and libraries for working with XML in various programming languages.

Remember:

XML is a powerful tool for sharing structured data efficiently and reliably.
Understanding its basic structure and principles opens doors to various data communication and manipulation applications.

# Welcome to the XML Tutorial!

Ready to unlock the power of structured data sharing? Let’s dive into XML!

In this tutorial, you’ll learn:

What XML is and why it’s awesome
Key elements and structure
How XML compares to HTML
How to create and parse XML documents
Using XML Schema for validation
Practical examples and hands-on exercises

Here’s our roadmap:

XML Basics:
- What is XML and why use it?
- Core components (tags, attributes, text, nodes)
- Tree structure and paths
- Similarities and differences with HTML
Creating XML Documents:
- Writing well-formed XML
- Choosing meaningful tag names
- Using attributes effectively
- Structuring data hierarchically
Parsing XML:
- Accessing data using different programming languages
- Navigating the tree structure with paths
- Extracting specific information
XML Schema:
- Defining valid structures for XML documents
- Ensuring data integrity and consistency
Practical Applications:
- Handling configuration files
- Exchanging data between web services
- Storing and organizing data
- Creating dynamic web content

Let’s get started!

<strong><person></strong>
  <strong><name></strong>Chuck<strong></name></strong>
  <strong><phone</strong> type="intl"<strong>></strong>
     +1 734 303 4456
   <strong></phone></strong>
   <strong><email</strong> hide="yes"<strong>/></strong>
<strong></person></strong>

So “XML” is what we call
the Extensible Markup Language. And basically XML, any serialization
format has some special characters and then some rules about how to
form the serialized document, basically, from the internal structures. And so the rules of XML and part of
the reason that XML become popular is, it became popular at about
the time that HTML became popular. Or you could almost say
that XML influenced HTML. But the notion of less thans and
greater thans as the active characters, as the way to tag or
otherwise mark the information. And so, that’s how it works, and so it works just like HTML, and
there’s start tags and end tags. And that sort of brackets a chunk
of stuff, and so people is a tag. And then person, and person,
there that’s another tag, an ending tag. And then name and /name,
that’s an ending tag. And the way to think about this
is there is a simple element. These are called elements or nodes, and we’ll have a couple of ways
visualizing these coming up. There is one kind of element that
just has some text in between. And so that is the simplest bit,
it’s called the simple element. And then another kind of element like
this person actually just has, sort of, child nodes associated with it. And so the simple elements are these, and then the complex elements are person and
people. And so they’re nested together. And the indenting is just something I’m
showing you to make it look pretty. XML doesn’t really care what
extra spaces you put in, but it certainly helps us as human
beings to understand what’s going on. And so the primary purpose of
XML is to share structured data. It was a simplified subset of this SGML, which kind of was the precursor
to both XML and HTML. SGML was a little hard, so you could almost think of XML as like
a simplified easy version of SGML. And so here are sort of the basics of it.
It has a start tag and an end tag. So that’s what the start tag
and end tag are. Start tag, end tag, start tag,
end tag, start tag, end tag. Now, so that’s what start tags and
end tags, the end tags are the ones
that have the slash. There is textual content, that’s
there’s the stuff between the tags. So there’s just the text,
what you call the text nodes. And then, there are attributes, and
attributes are always on the start tags. So the phone and the email, and they are key-value pairs using
double quotes, as the type=”intl”. And the key thing about XML versus
HTML is we get to make up the tags. In HTML we say things like h1 or
a for the anchor tag, or h1 for header level one. Here, based on how we are going to agree
between the two applications exchanging data, you can say the tag is person and
/person. You still have to follow the rules,
though, there is a slash tag at the end. So attributes. And
then you can also have a self-closing tag. And that is, you just include this
/> at the end of the open tag. And it’s as if you have a closing tag
of the same name with an empty text area. And so that’s what this is saying,
so that’s a self-closing tag. Okay, like I said the whitespace
doesn’t really matter so much. We can pull these things up,
we tend to indent them like any kind of programming
environment to help our own reading. But, in general, whitespace, except in the middle of things like these
text areas, the whitespace does matter. But, sort of, between here and here,
the whitespace doesn’t matter. Between here and here,
the whitespace doesn’t matter. So the whitespace, these extra
spaces that I used to indent it, it doesn’t really matter. The only time it matters is in between,
when you’re in a text area. Here’s some sample XML,
just to give you some ideas. Here we have,
there’s always one big outer tag of XML, you have the start tag and an end tag. And there’s only one of those, because
I can’t be, sort of, multiply defined. There’s always the outer tag.
We see a series of attributes. Right? So the attributes are key equals and then double=quoted string,
key equals double=quoted string. And if you look at the HTML, you’ll see
that this is exactly the same as HTML. The difference is in HTML you’re supposed
to have a thing called href= “blah, blah, blah”, right? And so HTML is kind of like XML
that’s more highly specified. Whereas this is just two programs
agree on a format and they use it. So we have an outer tag
that’s a complex element. And then we have, like in that title, you can have sort of things
that are in order, like this ingredient. You see some attributes on there, and then you have a text block
in the middle of here. So we’ll see in a second how these
things all work. And then a sub tag. It’s like a tree, we’ll see that
in a second, and a series of steps. And so these can be in order,
they can be more than one of these things. And we can create all kinds of structures
that are really designed based on the needs of our working two applications
that are trying to cooperate. So tags are the beginning and the end. Attributes are key-value
pairs on the start tag. Serialization and deserialization is the
act of taking from an internal structure in one programming environment,
sending it across the network. Deserialization is receiving
across the network and translating it back into an internal
structure on the destination computer. So there’s a couple different
ways that we can look at this. The most common one, and
the word nodes kind of comes from trees. Each of these is like a node,
because it sort of comes together. It’s a place of connection,
so we call this thing a node. And so you can think of the outer
document as the top node of the tree. It’s kind of an upside-down tree, actually. If this was a tree, it’d have a trunk. And then we just have stuff like this and
a squirrel sitting up here, right? But it’s kind of an upside-down tree. And so we have the top of the tree
here and then it has two child nodes, the b node and the c node
are directly beneath the a node. And then we model, as you’ll see in a bit why, we model
the text in between them as a child. So it’s a child of the b node, so
that’s the text is a text node, and it is a child of the b node. So the b node is all of this, and
then this is the child of the b node. And the c node has two children,
d and e. And the d has as child with capital Y and
e has a child of capital Y. So, these are the simple and
complex elements. And then there’s the text
within the elements. But, like I said,
we model the text as a child of the node, as you will soon see, right now. And that’s because we model
attributes as different children. So, if we change this a little bit, and
we make this have an attribute, w = “5”. all of this is part of the b node. And the b has the text area,
there’s only one of those, and there could be many attributes right?
There could be lots of attributes. I just have w=”5″, and so one of the children of the b node is
the attribute child or the text child. And there can be many of
these attribute childs, because there could be lots of attributes. You know a=”4″, b=”19″,
they’ve always got to have double quotes. And so you could have many these
attributes and they’re sort of children. But if you grab the node, and you’ll see when we start talking about
doing this in programming languages. We’ll see why it’s important
to kind of understand what it means when you grab this
versus when you grab that. Those aren’t the same thing. X is the text at the node b, and the node b is that text, and attributes,
and everything all rolled up together. Another way to think about these and a way to actually parse XML is
through what we call paths. In a sense what you do is you just draw
the tree and then you walk down the tree. And so this X could be thought of
as a piece of data that’s at /a/b. So you start at the top. Go down to a, go down to b,
what do you find there? If we go down here and go from a to c
to d, and find Y, that’s this one, /a/c/d. So this is like a path. And you can think of this like
folders on your computer. The a folder, then there’s two
folders within a which are b and c, the children folders. And then within c there is
children folders d and e. And so like this one here is /a/c/e and
then say, what’s living there at /a/c/e? The text that’s living at /a/c/e is Z. And so that’s another way to think about XML. Now the thing we’re going to talk about
next is an important aspect where we’re trying to decide
between two applications. If I”m producing data and
you’re consuming data, and you blow up, was it the data’s fault,
or was it your fault? And that’s what XML Schema does for us.

Video: 13.3 – XML Schema

Notes

Tutorial

Transcript

Summary of XML Validation with XSD:

Purpose: XML Schema (XSD) defines contracts for valid XML structure and data types. This avoids ambiguous interpretations and resolves disagreements between cooperating applications.
Structure: XSD is also an XML document with specific tags like xs:element, xs:complexType, xs:sequence, etc. to define allowed tags, data types, and their relationships.
Examples: We saw different XSD examples defining elements like person with required “lastname”, optional “county”, and specific countries allowed for “country”.
Data Types: XSD supports various data types like string, integer, date, time, etc. with specific formats like YYYY-MM-DD for dates and “T” for separating date and time.
Validation: XML documents can be validated against their corresponding XSD using validators. A valid document ensures data meets the contract and avoids issues during application exchange.
Further Learning: More complex XSD features like attributes, restrictions, and unbounded elements were briefly mentioned. We’ll move on to XML parsing in Python next.

Key Takeaways:

XSD provides a way to formally define acceptable XML structures and data types.
Validating XML against XSD ensures consistent data exchange and prevents application malfunctions.
There are various XSD features and data types with specific syntax and usage.

Here’s a tutorial on XML Validation with XSD:

Introduction

XML (Extensible Markup Language) is a flexible data format for storing and exchanging information.
To ensure consistent structure and data types in XML, we use XML Schema (XSD).
XSD defines a contract that XML documents must adhere to, preventing errors and misunderstandings.

What is XSD?

An XML Schema Document (XSD) is itself written in XML.
It defines the allowed elements, attributes, data types, and relationships within a valid XML document.
Common XSD tags include:
- xs:schema: Root element of an XSD document
- xs:element: Defines an element in the XML structure
- xs:complexType: Defines a complex element with child elements or attributes
- xs:sequence: Specifies the order of child elements
- xs:simpleType: Defines a simple element with a basic data type
- xs:attribute: Defines an attribute for an element

Key Features

Data Types: XSD supports various data types, including:
- string
- integer
- decimal
- date
- time
- boolean
- and more
Element Cardinality: XSD specifies how many times an element can occur using minOccurs and maxOccurs attributes.
Constraints: XSD can enforce constraints like:
- Enumerations (limiting values to a specific list)
- Restrictions (limiting values based on conditions)
Attributes: XSD can define required and optional attributes for elements.

Validation Process

Create an XSD document defining the structure and data types for your XML.
Use an XML validator to check if an XML document conforms to the corresponding XSD.
The validator reports any errors or warnings if the XML doesn’t match the XSD.

Benefits of XML Validation with XSD

Ensures data integrity and consistency
Prevents errors during data exchange between applications
Promotes data reusability and interoperability
Facilitates automated data processing

Example

XSD:

XML

<xs:schema>
  <xs:element name="person">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="lastname" type="xs:string"/>
        <xs:element name="age" type="xs:integer"/>
        <xs:element name="dateborn" type="xs:date"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Valid XML:

XML

<person>
  <lastname>Smith</lastname>
  <age>35</age>
  <dateborn>1987-03-21</dateborn>
</person>

Invalid XML (missing required element):

XML

<person>
  <age>35</age>
  <dateborn>1987-03-21</dateborn>
</person>

So we have two cooperating applications,
and they’ve got to send data to one another, and they have a disagreement as to
whether or not the data is right. One side might blow up or
the other side might blow up and it’s like whose fault is it? And so it’s important to be able to define
a contract as to what is acceptable XML. That’s really kind of outside
of either application. You can’t say,
the XML that works is the XML. The correct XML is the one that
causes my program not to blow up. That’s not really the best way to define
it, so we have to define it kind of at the moment of data exchange,
and that’s what XML schema does. It’s a way to sort of establish outside of
any program, and then separately check. So this side can say I sent good stuff, and this side said I received good stuff or
I received bad stuff. So you can just look at a document and
you can say yes, this validates, or no, it doesn’t validate. And validation is not the act
of transferring the data or even deserializing the data. Validation is the act of verifying
that the data is in the right format. It’s a contract. So
if you’re working with something like your airline reservation system that’s
working with an airline system, I mean that’s working with a hotel system,
you might say okay, and here’s the schema. And they just publish that separately and
agree that’s the schema. So if later the XML starts to change
in bad ways that break one side or the other, they can know which
side it was that started changing. And so XML validation is the act of
taking an document and a Schema Contract, which itself is also an XML document, and
then sending to the Validator. We’re not going to spend
a lot of time with this, but it’s an important concept to sort of
imagine and understand how contracts between cooperating applications
have to be developed. And so here’s a bit of XML. So
some XML person tag, remember the outer is only a single tag,
and then I’ve got three child tags. You know last name, age, date born, and
then we have a contract. And so like I said the contract itself is
XML with a couple of weird-looking tags. And what the XML is trying to say,
you know, what kind of tag this is, how much data you’re
supposed to put in it. Can you have any child tags underneath it? All those kinds of things
are the questions that are being asked by the contract. And so this xs;complexType
As we’ve said before, an element that can have children are called complex
and the element that can’t are simple. And so this basically says the outer bit
of this particular XML is expected to be a tag named person and
that’s what that’s saying. Then what it says is within that
there’s going to be a sequence of tags, xs:sequence. And then we basically say oh, and there’s going to be a tag called
lastname and it’s going to be a string. And then there’s going to be a tag
that’s age and it’s going to be integer. And there’s going to be a tag called
dateborn and that’s going to be a date. And so we can sort of look at this and
say outer one person, next one in name, age, that’s a number,
that’s good, that’s a string, that’s good. That looks like a date, that’s good. Check, check, check, check, check. This XML matches that contract and
that’s the idea. Now I’m not by the time you
actually have to make one of these, you’ll go be, you’ll have to read all
kinds of stuff and figure all this out. So I just want you to get the sense
that these things exist and they’re not impossible to read. And so it turns out again,
in the early 90s, there was a number of these schema languages that
were out there and you’ll run across them. I’m mostly talking about the one that
kind of won or the one you’re most likely these days to encounter called the XML
Schema from the World Wide Web Consortium. It’s called XSD and
usually in the file that you get, if I just have a file and I send you
the XML I have a suffix of .xml, and if I send you a schema, I tend to send you
a file that’s .xsd and so we kind of just call it XSD and that’s the one we’re
going to talk about and so away we go. That’s the one.
There is other. I just want you to know that
there’s other ones out there. So like I said. You have complex elements, you have
simple elements, you have a sequence. Those are sort of the basics of
the tags that you put into XSD. You can do further things, I just
gave you the simplest kind of stuff. And so, for example, I can basically say,
okay, this tag, full string? I want there to be a thing called
full_name, there’s a sequence. So this is sequence,
we can see that this is the sequence. And I want this to be minOccurs
equals 1, maxOccurs equals 1. That basically means there’s
going to be one and exactly one. One and exactly one. MinOccurs if you have less than one,
it’s an error, if you have more than one, it’s an error. This one here, this tag. This one can happen 0 times up to 10. So that just means if there is somewhere
between zero and ten of those tags between full_name and the end of person, we are
happy with this. And so that minOccurs and maxOccurs is basically how many
times these things must appear. So again this is just a little more sophisticated XSD. We’ve already played with at couple
of different kinds of data types. We’ve talked about string,
we’ve talked about date. We’ll look at the date
in a little more detail. Date, time, decimal, and integer. So it sort of understands the difference
between a floating point and an integer. So you can render an opinion
as to what you want in there. String, of course, is just about
anything and dates are kind of special. The date format that they chose, of
course if you go from different countries the date is all kinds of different things. The date format that they’ve chosen that’s
heavily used on the Internet is a date format that’s year four-digit year, dash,
two-digit month, dash, two-digit day. And that’s not how it is in America. We do 9/24/2002 in America. But in the web,
they basically said how about we pick. We didn’t want to pick the one
that was the most popular but pick something that computers would like. So this turns out by forcing it
to be fixed like this, it sorts. So if you were to sort this,
the year is the most significant. The month is second most significant. And the day is the next most significant. And you zero fill up to two digits for
the month and two digits for the day. And so January 1, 2001
is 2001-01-01 and so it’s the same length as 2002-12-31. And so you can just sort these and
it just sorts as strings. They sort quite naturally, whereas lots
of the formats that we use for dates in common use or in our own writing or on
checks or whatever, don’t sort so well. So we did this for computer folks, got
a picture of this coming up in a second. And the other thing is
if it’s a date-time, it’s exactly the same date format,
and then it’s the letter T, and then hours, minutes, and
seconds, and then a time zone. Dates and times in the web and
on the Internet are very problematic. Because in the real world where the sun
comes up and the sun comes down, and it matters what time of day, you kind of
want it to be noon when it’s light out. We have time zones. And then we have savings time. And then certain places violate the rules,
and they are a different time than
somebody that’s right next door. And so a thing that computers tend to do,
is they tend to ignore the time zone. I remember in the old days, computers would mess up when
daylight savings time happened. And they don’t anymore, because they tend
to think of all times inside the computer as Universal Time or Greenwich Mean Time. And that means it could even be
a different day, but then they offset it. And if you’ve ever traveled and
your calendar sort of switches, that’s because the time of your items
is the same inside the calendar. But then when your local time switches,
it just moves forward or back, six hours forward or
three hours back or whatever. And so, that’s why we tend. Now, this time formats can have time zones
in it, but it’s not highly recommended. and so we tend to see these Z times,
which are the UTC, and like I said it is filled out 0’s so that
all the columns have to be filled out. four-two-two with the dashes and that means
they’ll all have the same length and the letter T and then two digit two
digit two digits with the colons and then at the end the time zone. And like I said we tend to do
everything in absolute time. Here are some more XSDs. Let’s take a look at this one. Some of the documents have this little
xml that’s really an indication that it’s an XML document and then we have
the schema which is the outer one. There’s an address as the outer tag, and that tag goes all the way to here
because these are just key-value pairs. We have a recipient which is a string. Recipient, house, that’s a string I guess,
street which is a string. And then we have post code, county. What have we got here? Town, County, oh County, that’s optional. minOccurs 0, so you’ll notice
that there is no county here, because that’s minOccurs 0,
there’s no county. And then if we take take a look at
post code, post code is a string, so that’s a string. And then we have country, so
country’s an interesting one, so this is all about country right here. So country is a string, but we’re going
to restrict it with this enumeration, it says it has to be one
of these five strings. So it can’t be just anything,
it has to be one of these five strings. In this particular, we can validate that this UK is
indeed one of those valid strings. So this validates. So this bit here validates
everywhere we can check and validate that every one of the tags there
meets the needs of this XML schema. Here’s another schema that
has a couple of other things. xs:string, we’ve got that.
string, we’ve dont that. maxOccurs=”unbounded” That says as many as you want. minOccurs=”0″, we know
how to deal with that. xs:positiveInteger, that just
means your -14 is not allowed. And then use=”required”,
it says this has to be there. And you can talk about attributes as well. So you basically say you must have an
orderid attribute on this particular tag. Okay? And so I’m not trying to teach you
how to be an XSD wizard, just the notion that there is this syntax
that’s used to establish a contract so that you can resolve disagreements
between cooperating applications. So up next,
we’ll switch from schema back to XML and we’ll look at how to parse XML in Python.

Video: 13.4 – Parsing XML

Notes

Tutorial

Code

Transcript

Here’s a summary of the key points about parsing XML in Python:

1. Importing the Library:

Use import xml.etree.ElementTree as ET to import the necessary library.

2. Parsing XML Data:

Use tree = ET.fromstring(xml_string) to parse the XML string into a tree structure.

3. Accessing Elements and Attributes:

Single Elements:
- tree.find('tag_name').text to get the text content of a single element.
- tree.find('tag_name').get('attribute_name') to get the value of an attribute.
Multiple Elements:
- tree.findall('tag_name') to get a list of all elements with the specified tag name.

4. Looping Through Elements:

Use a for loop to iterate through a list of elements:
- for item in lst: # Where lst is the list of elements
- # Access elements and attributes within the loop

5. Important Points:

XML data is structured as a tree.
Use find to find a single element and findall to find multiple elements.
Access text content with .text and attribute values with .get('attribute_name').
Use loops to process multiple elements.
Schemas are important for ensuring data consistency and preventing errors.

Here’s a tutorial on parsing XML in Python:

1. Importing the Library:

Begin by importing the xml.etree.ElementTree library, which provides tools for parsing XML:

Python

import xml.etree.ElementTree as ET

2. Parsing XML Data:

Load the XML data into a tree structure for easy navigation:

Python

xml_string = """
<your_xml_data_here>
    </your_xml_data_here>
"""

tree = ET.fromstring(xml_string)

Replace your_xml_data_here with your actual XML content.

3. Accessing Elements and Attributes:

Single Elements:
- Find a single element by its tag name and access its text content:

Python

name = tree.find('name').text  # Example: Retrieves the text within the <name> tag

- Get the value of an attribute within an element:

Python

email_hide = tree.find('email').get('hide')  # Example: Retrieves the value of the 'hide' attribute in the <email> tag

Multiple Elements:
- Find all elements with a specific tag name and store them in a list:

Python

users = tree.findall('user')  # Example: Retrieves all <user> tags

4. Looping Through Elements:

Use a for loop to iterate through a list of elements:

Python

for user in users:
    user_id = user.find('id').text
    user_name = user.find('name').text
    print("User ID:", user_id)
    print("User Name:", user_name)

5. Additional Tips:

Namespaces: Handle namespaces in XML by specifying the full tag name with the namespace prefix:

Python

tree.find('{namespace_prefix}tag_name')

Error Handling: Use try-except blocks to catch potential parsing errors.

Example:

Python

import xml.etree.ElementTree as ET

xml_data = """
<users>
    <user x="2">
        <id>001</id>
        <name>Chuck</name>
    </user>
    <user x="7">
        <id>009</id>
        <name>Brent</name>
    </user>
</users>
"""

tree = ET.fromstring(xml_data)

users = tree.findall('user')

for user in users:
    user_id = user.find('id').text
    user_name = user.find('name').text
    user_x = user.get('x')
    print("User ID:", user_id)
    print("User Name:", user_name)
    print("Attribute X:", user_x)

This code outputs:

User ID: 001
User Name: Chuck
Attribute X: 2
User ID: 009
User Name: Brent
Attribute X: 7

import xml.etree.ElementTree as ET

data = '''
<person>
  <name>Chuck</name>
  <phone type="intl">
     +1 734 303 4456
   </phone>
   <email hide="yes"/>
</person>'''

tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))

# Code: http://www.py4e.com/code3/xml1.py
# Or select Download from this trinket's left-hand menu

Name: Chuck
Attr: yes

So now we’re going to move into writing
code in Python to deal with XML. Now, it’s not too difficult because
like most of the things we do in Python, the first thing we do is a really clever
import statement that does most of the work for us. So this is importing a library
xml.etree.ElementTree and this ET then becomes, that’s an alias. The syntax of as is like an alias. It ends up being a short form so
we don’t have to type this long thing. Now we’re going to, normally we would be
reading all of these data with urllib and read and whatever and
then we would parse it. But just to make these simple
on one screen I’ve kept it simple. And so I have a string. Now this is a new syntax that you
haven’t seen before, probably, and that’s the triple-quoted string. So a triple-quoted string in Python
is a potentially multi-line string. And so that’s the beginning of the string. The string ends down here. The newlines that are here
are part of the string. Okay, so this is as if we read this bit of stuff from here to here
in from a file or in from the web. So this is just my way of
emulating like a urllib and then a read so we can just
look at it all in one screen. So here’s our XML. And you see that it’s well-formed XML. We’ve got a beginning tag and
an ending tag, being it’s the same stuff
that I’ve been doing. So we have to parse it. And this is kind of like what we
do with HTML and Beautiful Soup. We have to pull this string data and
give ourselves an object back and then work with that object. And so we take this string data,
we pass it in to ET.fromstring. And what fromstring says
is take this string and give us back basically a nice tree. So to think back to those tree pictures,
give us back these trees and make sense of it. It’s still got the same thing like Chuck and
the phone number 303. All the stuff. They’re all in there, right? It just has kind of constructed
this as a internal memory structure inside of Python. And that’s what we get back
from ET which is this, that goes into this tree
variable right here. So we got this tree of information
that’s properly parsed. Now this could blow up. This could traceback. If you have a syntax error like you
didn’t put the slash in or something, this would fail. Say you got bad HTML or bad XML. And so that’s kind of what you got to do. But when it’s all said and done,
if this line of code succeeds, then you have good XML and
you can make sense of it, okay? And so what we could do is we can say within that XML data,
go find me the tag name. So that basically is this,
tree.find(‘name’) finds me that. So if you think of this in a little
picture, there is the tag named name. And then remember the child tag was Chuck. And so the whole tag is this. And then to get down and
get just this Chuck bit, we say .tx text. So this is this. And it’s also that. So if you want to get the text
that’s in between the name tag and the end name tag, you say tree.find,
tag name, and then .text. And that text is an attribute of this
particular node give me back that thing. And if you want to get an attribute,
not the text node, you can say, okay, go find me the email,
which is this, which is this. So if you look at the email, it looks
like this, email is the node, and it has an attribute of hide,
and what, yes. And there is no text, right? Because this is a self-closing tag so
this doesn’t exist. So we say tree.find(’email’). That gets us this whole thing, and
then we call the get method within it and say, get(‘hide’)). And that says go find me the attribute
named hide within the tagged email and so that then gives us back yes. So this whole expression, tree.find
email .get hide, gives us yes. And so that allows us to work our
way down in through some XML and pull stuff out of the XML. And that’s what we have to
do when we’re in a program. Okay? And so that’s the syntax for parsing XML. And if you go online you’ll see lots and lots of examples of how
to pull data out of XML. When I’m writing code like this I tend to
have to print out tree.find(’email’) and then I get a few things. So I tend to these expressions kind of
get long as you’re working your way down a tree of XML. And then you find a thing. So, don’t expect that you necessarily can
write this code perfect the first time. You sort to write a little bit,
then add a little bit more, then add a little bit more,
then add a little bit more, and then finally you see the thing that you want
to get out of your tree of data. So you can have either a tag, which is
sort of a simple tag that has a child, or you can have a tag
that has multiple tags. And so, we use a different way if
there are multiple child tags. So, here we have, again, the single, the triple-quoted technique where the
bigger outer tag is stuff, and there is a users tag below that, and
then there is a number. So the idea here is we have
many dot dot dot dot dot. Many users. User, user, user, we have
x equals 7, an id, and each user has a little bit of data,
etc., etc., etc. And so, now we want to be able to
write code that’s going to go through each of these user tags. And so we’re going to use the
findall method. And again, we take all of the text,
we pass it into fromstring, and we get back an object. stuff is a tree of information that’s parsed and gives us methods and attributes
that we can use to go through the data. So that’s what this does. We take it from the outside world
to the inside world in Python. Now, what we’re going to do
is we’re going to say oh,, okay, we’re going to call
the findall method in there. And we’re going to search for the users
tag, all of the user tags below users. So that’s what findall means, says, there’s a bunch of these under users,
there’s user tags. Find all of them and
then give them all to me. So what you basically get is these tags,
except in a list. Right? So these tags are in a list. Not just the word Chuck and
Brent, but the whole tag. In a sense, it’s a list that’s
itself little trees, right? That’s a tree. That’s a tree. So this is a list of tags,which is trees
with little mini trees of information. And so this is a list, it’s not a list
of strings, it’s a list of tags. But we can ask how long is it? So we’ll print how many there are,
and then we can loop through them. There’s just a little list, right? So it’s a list of two things, it has
a little tree here and comma little tree, little tree, so
that’s what we’re going to do. We’re going to write a for loop
to go through that, and we’re going to have an item that’s going
to iterate through each of these things. I can call that tag, for tag in
the list of tags, that would make sense. And so item is going to take on
the successive values of this list. This for loop is going to run twice. It’s going to run once for
this and once for that. And that’s what’s going on. And lst is, lst is that data structure. So now when we come in here we’re
going to have a little tag. And so the first user tag looks
like this with an x equals 2 and then a child tag of Id and then a child tag of name and the first
one is going to be 001 and Chuck. So item is going to point to this. So we can then do those
same kind of things. We can say within item find the name
tag so that grabs this bit out and then go grab you the text. So that this bit prints out Chuck. The same thing for go find the Id tag. Go find the Id tag and
then go grab the text field so that’s going to print out 001 and
we can go get the item and then there is the attribute
that is directly under it. And then the loop runs again and so
it’s now pointing at this bit right here. So item is now pointing to that tag. And it says go find the name tag and
find the text. So find me the name tag and
then find me the text within the name tag. And that prints out,
then it runs this line. Go find me the Id tag. Go find the Id tag. Find the Id tag and
then grab the text out of that, so 009 is going to print out there. And then from the original
tag go find the x attribute. That’s what item.get(“x”) is. And so that’s going to be 7. So that’s what we’re going to get there. So that’s going to pull that out. So in those two examples, I’ve shown you
how you sort of dig through a tree or loop through a list of trees. So that was a list of trees. Tree, tree,
those are the two basic things that you tend to do when you’re parsing XML. Is either cruise down a tree or get a list
of trees and then cruise down those trees. And sometimes you have
lists within lists and trees within trees, and
it can be very complex. And your programs can get complex but
sooner or later you get them working and
this is why the schema is so important. Because once your code works and
if they change that structure your code tends to blow up badly and so
you want to yell at the other person. Say like, why did you change the XML? And they’re like,
well I didn’t change the XML. And you’re like wait, here is the schema
that proves that you changed the XML. That kind of gets us through XML, and
there’s a lot of challenges to XML, XML is a very rich way to serialize data. Up next, we’re going to talk about a more
lightweight way to store data called JSON, or JavaScript Object Notation.

Video: Worked Example: XML (Chapter 13)

Notes

Tutorial

Code

Transcript

Summary of “Python for Everybody: Using XML in Web Services”

The video introduces working with XML data in Python using the ElementTree library. Here are the key points:

1. XML Basics:

XML is a structured format for storing data, often used in web services.
The video uses sample XML strings for demonstration.

2. Parsing XML with ElementTree:

ElementTree is a built-in Python library for parsing XML.
The fromstring method converts an XML string into a tree object.
The tree object represents the nested structure of the XML data.

3. Accessing Data in the Tree:

You can use .find and .findall methods to find specific tags in the tree.
.text attribute retrieves the text content within a tag.
.get method retrieves the value of an attribute on a tag.

4. Looping through XML Elements:

You can use loops to iterate through a list of tags returned by .findall.
Each element in the loop can be accessed and manipulated using the same methods as individual tags.

5. Example Code:

The video demonstrates two example Python scripts (xml1.py and xml2.py) using ElementTree to parse sample XML data.
The scripts extract text data and attribute values from the XML.

6. Conclusion:

The video provides a basic introduction to working with XML data in Python using ElementTree.
Further exploration of the library and practical applications in web services are encouraged.

Tutorial: Using XML in Web Services with Python

This tutorial introduces you to the fundamentals of using XML in web services with Python. We’ll explore how to parse XML data, access its content, and interact with web services that utilize XML communication.

Prerequisites:

Basic understanding of Python programming
Knowledge of web services concepts (optional)

1. XML Basics:

XML (eXtensible Markup Language) is a structured format for storing data used for sharing information between applications and platforms.
It has nested tags to define data elements and attributes, making it human-readable and machine-processable.

2. Parsing XML with ElementTree:

Python’s built-in ElementTree library simplifies XML parsing.
Use fromstring(xml_string) to convert an XML string into a tree object representing the data structure.
Access specific elements within the tree using methods like .find(tag_name), .findall(tag_name), and path expressions.

3. Accessing Data in the Tree:

Extract text content from a tag using .text attribute.
Retrieve attribute values with the .get(attribute_name) method.
Loop through lists of elements using standard Python loops.

4. Interacting with Web Services:

Web services communicate data exchange utilizing protocols like SOAP and REST.
Many libraries like requests and zeep facilitate sending and receiving XML data to/from web services.
Use these libraries to build your client applications that interact with web services through XML-based communication.

5. Example:

Let’s consider a simple weather web service that provides temperature data in XML format. Here’s an example Python script utilizing requests and ElementTree to retrieve the current temperature:

Python

import requests
from xml.etree import ElementTree as ET

# Define the web service URL
url = "https://weather.example.com/current?city=London"

# Send a GET request and retrieve the XML response
response = requests.get(url)

# Check for successful response
if response.status_code == 200:
    # Parse the XML response
    root = ET.fromstring(response.content)

    # Find the temperature tag
    temperature_tag = root.find("temperature")

    # Extract the text value (temperature)
    current_temperature = temperature_tag.text

    # Print the current temperature
    print(f"Current temperature in London: {current_temperature}°C")
else:
    print(f"Error retrieving weather data: {response.status_code}")

6. Practice and Resources:

Explore web services with available public APIs that return XML data.
Practice building more complex applications that parse and process XML data received from web services.
Refer to the official ElementTree documentation and web service libraries’ documentation for more detailed information and advanced features.

Remember, this is a basic introduction. Further exploration and practice will help you master using XML in web services with Python effectively.

This tutorial serves as a starting point. Feel free to adapt and expand on it based on your specific needs and chosen web services!

import xml.etree.ElementTree as ET

input = '''
<stuff>
    <users>
        <user x="2">
            <id>001</id>
            <name>Chuck</name>
        </user>
        <user x="7">
            <id>009</id>
            <name>Brent</name>
        </user>
    </users>
</stuff>'''

stuff = ET.fromstring(input)
lst = stuff.findall('users/user')
print('User count:', len(lst))

for item in lst:
    print('Name', item.find('name').text)
    print('Id', item.find('id').text)
    print('Attribute', item.get("x"))

# Code: http://www.py4e.com/code3/xml2.py
# Or select Download from this trinket's left-hand menu

User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7

Hello, and welcome to Python for Everybody. I’m Charles Severance, I’m the author of the textbook, and we’re going to do a little bit of code. If you want to get your hands on the code, go to the materials website, materials.php, it’s actually materials.php, and download the sample code. The code that we’re going to work on today is the XML code, and we need to be able to talk XML, to work with web services. So, here’s one of the examples from the book, it’s xmlone.py. So, later we’ll be pulling XML, and Jason from the web, but for now we’re going to put it in a triple-quoted string, so data, and we’re going to use a built-in XML parser in Python called ElementTree. When we say import XML etree, Element Tree as ET, this as ET gives us basically a shortcut handle for it. So, the idea, this is a string that has less sense, and greater thans. It looks like structured information, and it is, but really at this point, it’s only a string. Now, we have to call this ET from string, to read this, and give us back a tree object, and what it does is this might blow up. This code might blow up right here, if there was a mistake in it. Matter of fact, I can probably put a mistake, and let’s see if I can delete this, and save it, and run this code, and we’ll see that it will blow up, right? So, it blew up here in line A, ElementTree blew up, I mean it it blew up in line 12 of the code, which is right here, this failed because the line eight of the XML string was wrong. So, let’s put the slide that back in. So, now it’s properly formed XML. So, this tree, we get back, I name it tree just because I always name it tree, but you could name it X. So, the key is as tree.find goes, and looks for a tag name, find in it, tree is no longer got less thans, and greater thans in it, and turn these into objects within objects within objects. So, tree find name says, “I would like to find the tag name,” and that’s what this bit is right here, and then.text.text is going within that, and grabbing that text. If we say treefind.email, then that’s going to give us this, and then that object, and then.get asks for the contents of the hide attribute, which is the string, yes. So, if we run this, now that it’s fixed, python3 xml1.py, it will pull in, and get the at the name, and the attributes. So, it pulled the checkout, and so you get this object, and then you dive into that object. So, that’s xml1.py. If you’ve got a tag, you can either get the text out of the tag or you can get an attribute out of a tag. So, now, let’s take a look at xml2.py. Again, we import ElementTree, and we have a tag, and there’s XML’s always got to have a single outer tag. But, this time we’re going to have an effect a list. Now, let’s line this up a little better. There we go. That looks a little prettier. So, the fact that as users doesn’t mean anything, but we often come up with semantically meaningful names for these things. Users is going to have it as a children, a list of user tags. So, the children under user, user under user, and then this has each of these as a tag. So, we want to parse this, and this is a common thing we want to do. So, again, the first thing we do is we read the string to just take this. It’s a triple-quoted string going from here to here. Then, we’re going to instead of doing fine, which gives us one tag, we’re going to find all, the users tag, the user tag, that is a child of users, and we get back a Python list of the tags, not of the texts, but of the tags. So, there’s a one tag, and there is another tag. So, you do len of that, so we can see that we got two. Then, we can write a four loop, and this item is going to iterate through the tags that are the user tags, that are children of users. So, the first time item is going to be this tag, a tag, remember. Then, the second time, it’s going to be this tag. So, we can do things like find and get, just like we did with in xml1. So, running this is not too exciting, python3 xml2.py. You see that there are two users that comes from this print right here. There are two users in there, and the first one, if we go into name, and we go find the text within the name tag, within user, then we get chuck, and then we get the ID, which is 001, we find the ID within that item, and then we get the text, and then we look, and we grab the X attribute off of that. So, we see chuck 001 and two. Then, in the next tag, we get the four loop continues, and we print that out. So, that’s just a basic run-through of the XML from the chapter in the Python book. Okay. Thanks.

Assignments

Quiz: eXtensible Markup Language

What is “serialization” when we are talking about web services?

The act of taking data stored in a program and formatting it so it can be sent across the network.

The correct answer is:

The act of taking data stored in a program and formatting it so it can be sent across the network.

Serialization in the context of web services refers to the process of converting complex data structures, like objects or arrays, into a format that can be easily transmitted over a network. This often involves converting the data into a human-readable format like JSON or XML, or a binary format like Protocol Buffers.

Here’s why the other options are incorrect:

Sorting all the data stored in a tuple: This is a specific operation on a particular data structure (tuple) and not related to sending data across networks.
Marking each network packet so it can be put back into order on the receiving system: This describes sequencing, which is part of the network communication process but not the same as serialization.
Making it so that dictionaries can maintain their keys in sorted order: This is a specific optimization for dictionaries and not directly related to web services or data transmission.

So, when talking about web services, serialization is all about preparing data for its journey across the network, ensuring it can be understood and reassembled on the other side.

I hope this clarifies the concept of serialization in web services!

Which of the following are not commonly used serialization formats?

HTML, TCP, Dictionaries

Out of the listed options, Dictionaries and TCP are not commonly used serialization formats. Let’s break down why:

JSON and XML: Both are widely used, human-readable serialization formats. They represent data structures like objects and arrays in a clear and concise way, making them popular for exchanging data between web services and APIs.
Dictionaries: While dictionaries are data structures commonly used in programming languages, they aren’t directly considered serialization formats. They organize data within a program but wouldn’t be used to encode it for transmission across a network.
HTTP: HTTP is a communication protocol, not a data format. It defines how information is exchanged between systems but doesn’t specify how the data itself is structured. JSON and XML are often sent within HTTP requests and responses, but they are separate entities.
TCP: Similar to HTTP, TCP is a transport protocol that ensures reliable data delivery across networks. It focuses on packet transmission and reassembly, not data formatting.

Therefore, Dictionaries and TCP are not considered common serialization formats in the context of web services communication.

I hope this helps! Let me know if you have any other questions.

In this XML, which are the “simple elements”?
<people>
<person>
<name>Chuck</name>
<phone>303 4456</phone>
</person>
<person>
<name>Noah</name>
<phone>622 7421</phone>
</person>
</people>

name, phone

The simple elements in the given XML are:

name
phone

Simple elements in XML have the following characteristics:

Contain only text: They don’t have any nested elements or attributes.
Can’t have children: They can’t contain other elements within their start and end tags.

In the provided XML, the name and phone elements meet these criteria:

They hold only text content (e.g., “Chuck” and “303 4456”).
They don’t contain any other elements within them.

In contrast, the people and person elements are considered complex elements because they have child elements within their structures.

In the following XML, which are attributes?
<person>
<name>Chuck</name>
<phone type=”intl”>
+1 734 303 4456
</phone>
<email hide=”yes” />
</person>

type=”intl” within the phone element
hide=”yes” within the email element

In the given XML, the attributes are:

type=”intl” within the phone element
hide=”yes” within the email element

Attributes in XML provide additional information about elements. They follow these key characteristics:

Appear within the opening tag of an element: They are placed within the element’s start tag, after the element name.
Have a name-value pair structure: They consist of a name followed by an equal sign (=) and a value enclosed in quotes (either single or double).

In this example:

The phone element has an attribute named type with a value of intl, indicating that it’s an international phone number.
The email element has an attribute named hide with a value of yes, suggesting that the email should be hidden.

In the following XML, which node is the parent node of node e
<a>
<b>X</b>
<c>
<d>Y</d>
<e>Z</e>
</c>
</a>

node c

The parent node of node e is node c.

In XML, the parent-child relationship between nodes is defined as follows:

A parent node directly contains another node within its structure.
A child node is directly enclosed within the start and end tags of its parent node.

In the given XML, node e is nested within the start and end tags of node c, making node c its immediate parent. Here’s a breakdown of the hierarchy:

Root node: a
- Child nodes: b, c
  - Child nodes of c: d, e

Therefore, node c is the parent of node e in this XML structure.

Looking at the following XML, what text value would we find at path “/a/c/e”
<a>
<b>X</b>
<c>
<d>Y</d>
<e>Z</e>
</c>
</a>

The text value you would find at path “/a/c/e” in the given XML is “Z”.

Here’s how the path navigates through the XML structure:

“/a” targets the root node named “a”.
“/a/c” selects the child node named “c” within node “a”.
“/a/c/e” pinpoints the child node named “e” within node “c”.

The text content of node “e” is “Z”, so that’s the value you would find at the specified path.

What is the purpose of XML Schema?

to establish a contract as to what is valid XML

The purpose of XML Schema is to establish a contract as to what is valid XML.

Here’s a more detailed explanation:

XML Schema (XSD) is a language for defining the structure and constraints of XML documents. It acts as a blueprint or rulebook that outlines what elements, attributes, and data types are allowed within a valid XML document.
It enforces rules for data integrity and consistency. By adhering to an XML Schema, you ensure that XML documents are well-formed and adhere to specific requirements, making them easier to process and exchange reliably.

Here’s how it works:

Schema Definition: An XML Schema is written in a specific syntax that defines the allowed structure of XML documents.
Validation: XML documents can be validated against an XML Schema to ensure they conform to the rules. This validation process catches errors and inconsistencies early on, preventing issues later in the data processing pipeline.
Shared Understanding: XML Schemas facilitate a common understanding of the structure and meaning of data among different systems or applications. This ensures compatibility and interoperability when exchanging XML data.

Key benefits of using XML Schema:

Improved Data Quality: Enforces data integrity and consistency.
Enhanced Data Sharing: Enables reliable exchange and interpretation of XML data across systems.
Better Maintainability: Provides clear documentation of XML structure, aiding in understanding and maintenance.
Stronger Type Checking: Ensures data adheres to specific data types, reducing errors.

For this XML Schema:

<xs:complexType name=”person”>
  <xs:sequence>
    <xs:element name="lastname" type="xs:string"/>
    <xs:element name="age" type="xs:integer"/>
    <xs:element name="dateborn" type="xs:date"/>
  </xs:sequence>
</xs:complexType>

And this XML,

<person>
   <lastname>Severance</lastname>
   <Age>17</Age>
   <dateborn>2001-04-17</dateborn>
</person>

Which tag is incorrect?

Age

The incorrect tag in the XML is Age.

Here’s why:

XML is case-sensitive: It distinguishes between uppercase and lowercase letters in element and attribute names.
The XML Schema specifies the element name as “age” in lowercase. However, the XML document uses “Age” with a capital “A”, violating the schema’s definition.

To correct this, the XML should be modified to use the lowercase “age” tag as specified in the schema:

XML

<person>
   <lastname>Severance</lastname>
   <age>17</age>  <dateborn>2001-04-17</dateborn>
</person>

By making this change, the XML will conform to the rules defined in the XML Schema, ensuring its validity and consistency.

What is a good time zone to use when computers are exchanging data over APIs?

Universal Time (UTC) or Greenwich Mean Time (GMT).

The best time zone to use when computers are exchanging data over APIs is Universal Time (UTC) or Greenwich Mean Time (GMT).

Here’s why:

Global Standard: UTC is the primary time standard by which the world regulates clocks and time. It’s a single, consistent reference point for timekeeping, avoiding confusion and ambiguity when data is exchanged across different time zones.
Avoids Misinterpretations: Using local time zones can lead to inconsistencies and errors, as different systems might interpret timestamps in their own local context, potentially leading to incorrect calculations or misunderstandings.
Daylight Saving Time Issues: UTC also eliminates issues related to Daylight Saving Time (DST), which can further complicate time-related data if not handled properly. UTC stays consistent throughout the year, regardless of DST changes in different regions.
Clear and Unambiguous: Using UTC ensures that timestamps are interpreted the same way by all systems involved in the data exchange, regardless of their physical location or time zone settings.
Common Practice: UTC is the widely accepted standard for time-related data in APIs and communication protocols, making it the most compatible and reliable choice for exchanging data across systems.

Therefore, using UTC ensures clarity, consistency, and accuracy when handling time-related data in API communication, making it the preferred choice for ensuring seamless data exchange across different systems and time zones.

Which of the following dates is in ISO8601 format?

2002-05-30T09:30:10Z

The date in ISO 8601 format is:

2002-05-30T09:30:10Z

Here’s why:

ISO 8601 specifies a specific format for representing dates and times.
In this format, the year is always listed first with four digits (YYYY), followed by the month with two digits (MM), and then the day with two digits (DD).
The letter “T” separates the date from the time.
The time is represented with 24-hour notation, using two digits for hours (hh), two digits for minutes (mm), and two digits for seconds (ss).
Optionally, milliseconds can be included with three decimal places.
The letter “Z” at the end signifies that the time is in Coordinated Universal Time (UTC).

Therefore, only 2002-05-30T09:30:10Z adheres to all the ISO 8601 requirements.

May 30, 2002 lacks the specific separators and format required by ISO 8601.
2002-May-30 uses month names instead of two-digit numbers, violating the standard format.
05/30/2002 uses American date format (MM/DD/YYYY) instead of the ISO 8601 format (YYYY-MM-DD).

I hope this helps! Let me know if you have any other questions.

Graded App Item: Extracting Data from XML

Reading

Extracting Data from XML

In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/geoxml.py. The program will prompt for a URL, read the XML data from that URL using urllib and then parse and extract the comment counts from the XML data, compute the sum of the numbers in the file.

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://py4e-data.dr-chuck.net/comments_42.xml (Sum=2553)
Actual data: http://py4e-data.dr-chuck.net/comments_1511277.xml (Sum ends with 44)

You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment – so only use your own data url for analysis.Data Format and Approach

The data consists of a number of names and comment counts in XML as follows:

<comment>
  <name>Matthias</name>
  <count>97</count>
</comment>

You are to look through all the <comment> tags and find the <count> values sum the numbers. The closest sample code that shows how to parse XML is geoxml.py. But since the nesting of the elements in our data is different than the data we are parsing in that sample code you will have to make real changes to the code.

To make the code a little simpler, you can use an XPath selector string to look through the entire tree of XML for any tag named ‘count’ with the following line of code:

counts = tree.findall('.//count')

Take a look at the Python ElementTree documentation and look for the supported XPath syntax for details. You could also work from the top of the XML down to the comments node and then loop through the child nodes of the comments node.

Sample Execution

$ python3 solution.py
Enter location: http://py4e-data.dr-chuck.net/comments_42.xml
Retrieving http://py4e-data.dr-chuck.net/comments_42.xml
Retrieved 4189 characters
Count: 50
Sum: 2...

Code

import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET

url = input ('Enter url: ')
print('Retrieving', url)

total = 0
count = 0

uh = urllib.request.urlopen(url)
data = uh.read()
print('Retrieved', len(data), 'characters')

tree = ET.fromstring(data)
lst = tree.findall ('comments/comment')

for item in lst:
    count = count + 1
    t = item.find ('count').text
    total = total + float (t)
    
print ('Count:', count)
print ('Sum:' , total)

Bonus Material

Video: Interview: Roy Fielding – Understanding the REST Architecture

Notes

Transcript

The speaker, Roy Fielding, describes the evolution of his “REST architectural style” concept, which served as a guiding principle for web application development and ultimately became the foundation of the HTTP standard. Here are the key points:

Early Web:

Informal development using mailing lists and rapid iteration.
Different implementations of web protocols led to a need for standardization.

Fielding’s Role:

Worked on URL, HTML, and HTTP standards.
Developed the “ACP object model” as a mental model for web applications.
Used this model to inform HTTP standard decisions and advocate for good practices.

Formalizing the Concept:

Years later, discovered existing research on software architecture that resonated with his model.
Wrote his PhD dissertation, “Architectural Styles and the Design of Network-based Software Architectures,” formalizing the REST concept.

Impact:

REST principles became widely adopted and shaped the architecture of the modern web.
Fielding emphasizes the value of freedom and collaboration in achieving such impactful research.

Overall, the passage highlights the iterative and collaborative nature of internet development, where practical experience and theoretical understanding come together to shape foundational technologies like the web.

[MUSIC] The REST architectural style started
as a model of how the Web should work, particuarly how web applications should work, in the sense that we had in the early 90s,
back in 1993, 94, we had a pretty good deployed system,
the World Wide Web. We had clients and
servers, user agents, browsers, whatever you want to call them. And simple servers primarily serving
up files and a few database systems. And we had this desire to, once they’re more than a few
implementations of the Web protocols, we wanted to standardize those protocols
as part of the W3C and as part of the EITF. And in general just to resolve some of
the disagreements amongst the developers. At the time, most of the Web
was built informally using a mailing list, primarily,
as our coordination mechanism. We talked all around the world about a new
feature, and frequently we would come up with an idea in one time zone, and someone
would implement it in another time zone. And by the next morning,
you’d know what worked and what didn’t work with that feature. So it was very freeform, very fast. As the companies got involved,
they of course wanted to find ways to make use of the Web corporately to
make it as one of their platforms. And so they wanted to make it more
businessy, and one of the ways to make things more businessy is to
create common standards for everyone to adhere to rather than
adopt things as you go along. And as one of the developers
of a protocol library for the Web called lib www dash Perl,
and that’s the last time I used www in a product name
because it’s too hard to say, I was asked to help work on the standards. Both the URL standard at the time,
the HTML standard, and later on the HTTP standard. Because I was a graduate
student at UC Irvine, and I had all the freedom in the world, hadn’t
started working on my dissertation yet, and but I had finished all my class work. And that gave me both the freedom and
the ability to write for the Web in addition to the programming
that I was still doing. And it just worked out that being
in that position was great, in that I could have a hand at making the Web better because at the time it had
grown out in every direction at once. But at the same time I was
faced with the dilemma of, I have many competing interests working towards making the Web what
they think is a better place and how do I differentiate between
the ones that are actually better for the Web and the ones that are back to
some older version of an architecture or an architecture that just doesn’t
make any sense at all on the Internet. And so I came up with something
called ACP object model. At the time, object
models were the thing. So that’s why I called
it an object model, even though it had nothing
to do with objects. it was still a model of how I
expected Web applications to behave. And the team that was
working on the specification, mostly myself and
Henrik Frystyk Nielsen at the W3C, we were asked to
write the HTTP standard. And this was my model of describing
to each other, basically, how a particular change to the standard
would affect the resulting Web. Because the Web itself is
really a network of standards. And I used that throughout the years
just as basically a thought description. If someone would offer a feature or describe something that they
thought was wrong with the Web, I would use the model as sort of an analogy or a proof point to show what it is
about HTTP that works at that model, and what it is that the new
feature might hurt or might help. And that allowed me some
intellectual leverage, in many ways, to effect how the HTTP standard worked. It wasn’t until many years later after
I had done the literature search for software architecture, that I figured out
the right words to use to describe it. I saw a paper by Dewayne Perry and
Alex Wolf, in the software engineering,
one of the software engineering papers, buried in one of
the ACM SIGSOFT proceedings, which aren’t distributed very far beyond
your school library and things like that. But I found this paper, and
it was the only software architecture paper that described
architecture in terms of both the components and connectors
of typical architecture diagrams, but also the data that’s
processed through the system. And my realization is that all these
architecture papers which I had read. which didn’t make any sense to me because they were all talking about
the blueprints of an architecture. And this paper was talking about
the actual runtime architecture. The actual behavior of the system,
and that’s what I was building. >> It sounds like you were flowing between
a very practical and pragmatic world and this kind of theoretical world,
and just flowing gently back and forth for some period of time and
like picking up on both sides. >> Exactly. One of the great benefits I had
at UCI was the all the freedom to pursue these different areas. I was actually working in a team doing research on global software
engineering environments. So I was trying to use
the Web as a platform for software engineering,
essentially what GitHub is today. That was my research project, and as part of that, I could do all
of this other work related to it. One of the nice things about general
research funding at the time. >> So at some point you had to
sort of, like, take a breath and finish your thesis. >> Yeah,
I came from an academic background, my father’s a professor of geography and
urban economics. And so
I always wanted to complete the PhD. It was never a question of running off and
joining a startup even though my startup friends were
becoming millionaires left and right. There was always that
desire to finish the PhD. >> Was it easy to write, did it come
naturally at that point would you, I mean, was the idea fully in
your mind at that point? >> Oh, yeah. The idea was not only fully my mind but
almost past it at that point, because I finished HTTP,
finished the HTTP standard in 1997. And it wasn’t until I had done the work, that actually a colleague of mine,
Larry Micinter came and was talking to me about a related
subject and I was telling him about how, I’ve done all this work, I don’t
know what to do for my dissertation. And he just looked at me and said well,
you’re the only one who can describe HTTP, why it’s there, and what it’s there for.
Why don’t you just do that? And, so that gave me the impetus
to actually say, well, I can do this. I can describe what I all ready did,
I can actually describe it. But then, the question was I
been just fooling around in that that wasn’t my academic work. My academic work was over here and
my practical work was on the Web and I hadn’t really mixed the two
other than general knowledge. And for me, going back and trying to find
the real knowledge framework for architectural styles was my way
of fitting it all together. >> What’s it like to be the creator of a
dissertation that someone actually reads? >> It’s funny because when I
was a graduate student, one of the main motivators that
the professors would have. they would very seriously look at us. was
don’t worry about what you put in your dissertation. Nobody
is going to read it anyways. I might, because I’m on your committee,
but don’t worry about it because no one outside your committee is ever
going to read this thing. Just get it done and go on. It’s just not my style
to do that kind of writing. So I consider it my first book,
my only book really. And for me, writing is very difficult,
in the sense that I spend a lot of time thinking about
each sentence, each paragraph. I’m not the kind of person who
writes down a quick rough draft and then goes through and edits it again. I tend to edit sentence by paragraph and add a paragraph then delete two
paragraphs, and then go back. That kind of thing. So it’s gratifying that people
like to read a dissertation. Part of the, it’s certainly an
accessible piece of work. It’s not full of equations. There’s one equation The equation is
there just to have an equation, by the way. It’s not actually necessary, but
it’s nice to have one. >> It’s rare that academic research
has a profound impact, and honestly the freedom you’ve had
is what everyone should have. >> Exactly, the freedom gave me
the ability to do technology transfer beyond their wildest imagination,
which is great. What’s hilarious from my standpoint
is I was just having fun. I was trying to do my good deed for
the universe kind of thing, and it was all for free, basically. But it was, for me, fun. Enjoyable people, wonderful conversations,
learned an incredible amount. [MUSIC]

Video: Bonus: Office Hours – Boston

Notes

Transcript

This video features Dr. Chuck Severance hosting an office hour session for his Python programming class at the Atlantic Brew House in Boston. Students introduce themselves and share their positive experiences with the class, highlighting its engaging nature and effectiveness in teaching Python. The video ends with cheers and goodbyes, as the next office hour is a few weeks away.

Here are some key points:

Dr. Severance uses an informal setting (a beer garden) to connect with students outside of the classroom.
Students express their appreciation for the class, mentioning its fun atmosphere and effectiveness in learning Python.
Some students share their specific reasons for taking the class, such as contributing to open-source projects or exploring new career opportunities.
The video emphasizes the supportive and enthusiastic learning environment fostered by Dr. Severance and his Python programming class.

Overall, the video provides a glimpse into a engaging and effective Python learning experience led by Dr. Chuck Severance.

I like how you’re taking video of us. >> [CROSSTALK] Look at me,
I’m taking video of me taking a video. >> Yes. >> Okay, hello everybody. Here we are in Boston at
the Atlantic Brew House, right? The Atlantic Beer Garden. We’ve had a great conversation about
Internet history, technology, security, and Python programming for everybody,
and so I’d like to introduce you to some of the students in the class
and have them say hi to you and whatever. You don’t have to use your whole name. Just say hi and use your first name. >> John here. >> John here.
Okay. >> Hi, I’m Kelly. >> Hi, I’m Shane. >> Hi, I’m Oscar\g. >> You can say something
to the class if you like. >> Hi, I’m Todd. >> Something interesting. >> Hey, I’m Sean\g, this is one of
the best classes I’ve taken so far. >> Hi, Emily. >> Hi, I’m Tenji\g, nice to meet you. >> My name is Wyatt Jackson
from Big Beat\g Data. This is the best. >> Still going. >> I’m Summit. I’m taking this class first time,
and you should too. It’s a fun class. >> Hi, I’m Alex. I’m only 13 years old, and
I really love this class. >> Hi, I’m Ahmed. I’m exploring Python, and
I am enjoying this course. >> I’m Joel. I’m learning Python to contribute
to a bunch of open-source tools I use every day. >> Hi, my name is Pondelli. I’m in software industry, and
I really enjoyed the class. Very effective, and
I love Dr. [INAUDIBLE]. >> Hi, I’m Grace. I took Python this past winter,
and I really enjoyed it. >> I’m Jon, and I’m just with her. >> [LAUGH]

Hi, I’m Fay. I want to learn more about Python and
I’m a big fan of Moose. >> Great.
So there we are. I think we’ll all say hi or
clap or whatever. >> [APPLAUSE]
Wave to the whole class, okay. There we go. I don’t know where we’re gonna be next. It’s gonna be a couple of weeks
before I have another office hour. So, cheers.

Video: Bonus Video: Ian Horrocks / RDF / OWL (Advanced)

Notes

Transcript

This passage describes the development of OWL (Web Ontology Language) and its impact on the field of knowledge representation (KR). Here are the key points:

Early Days:

The speaker, Ian Horrocks, worked on ontology languages and reasoning systems within medical informatics.
He joined forces with others in Europe to develop a description logic-based language called OIL.
This effort merged with the DAML program in the US, resulting in DAML+OIL.

Standardization and Evolution:

The goal was to standardize DAML+OIL as a web ontology language, leading to the creation of the OWL working group.
While initially conceived as a simple process, the inclusion of concerns from the Web community (integration with RDF, compatibility with existing standards) significantly extended and altered the development timeline.
The final OWL language retained the core logic of DAML+OIL but implemented changes in syntax and RDF integration.

Impact:

OWL’s standardization provided a crucial turning point for KR. Previously, numerous incompatible variants limited adoption.
With a standard language and growing infrastructure, researchers and developers embraced OWL, particularly in academia and other scientific disciplines.
The speaker emphasizes the unexpected emergence of an industry around OWL tools and infrastructure, fueled by the shared plumbing and enabling development of diverse applications.

Challenges and Success:

OWL’s success, the speaker argues, lies not in its perfection but in establishing a common ground and enabling practical application.
Despite initial disagreements and compromises, the community collectively moved forward and built upon the standardized language.
The speaker applauds the community for achieving this success, highlighting the widespread use of OWL and RDF tools in unexpected applications.
He encourages embracing this as a significant accomplishment and a testament to the collaborative efforts within the KR community.

Overall, the passage highlights the collaborative effort behind the development and success of OWL, emphasizing its role in standardizing and promoting the use of KR language in diverse fields.

[MUSIC] My background had been working in
medical informatics and developing what we would now call ontology
languages and reasoning systems. Although actually to be honest
in the medical informatics area, we weren’t necessarily calling them
ontologies back then, we do now. And I went to a meeting
of a European Network, met people like Franklin Hamlin,
Dieter Fencil, who were also interested in
the beginnings of this area. And I managed to convince them that
this description logics area, which I’d been working in, which is a sort of
logic whose rationale is to formalize what we now call ontology languages, that
that would be a good starting point. It had more expressive power and a very clear formal semantics with
just first-order logic, basically. Just a fragment of first-order logic. And we went from there. And between us,
we developed a language called OIL, which was based on a description logic
which was already around at the time. We met people in the U.S.
like Jim Hendler, and the DAML program, people working on the DAML program.
We all decided that, hey, we’re more or less trying to do the same thing, why
don’t we pool our resources, which we did. Came up with DAML+OIL, wasn’t really
much different from the OIL thing. And then the idea was to go for,
to try to develop this into a standard so more people would
really be able to use it. And this was where OWL originally
came in, and then the OWL working group started and
we went through the process. What we thought would be the easy
process of standardizing DAML+OIL, as a Web ontology language. So then of course, a whole new
bunch of people joined the party, which were the sort of Web people. And of course, they had a whole
load of concerns of their own. Things that were important to them. Which were things like,
integration compatibility with RDF and generally with Web infrastructure and
existing standards. So actually the process then of changing
DAML+OIL, evolving DAML+OIL into OWL, it took longer than we thought,
involved a bigger change than we thought, and I think it took a couple
of years in the end. And much more than that off my life,
[LAUGH] ten years off my life, I think. And but, I mean, it was pretty interesting, and
I learned a lot there, as well. And the language evolved not a great deal, but the few, it was mainly the syntax and
the relationship with RDF that changed. The underlying logic
didn’t change very much. And of course the semantics didn’t change. Because that just flows from the logic. The huge impact of OWL
was just the fact that being these kind of KR languages around for
donkey’s years, as you know. And but there’d been, you know,
every university research group had typically
created their own variant, their own little flavor,
all somewhat incompatible. And it had been conjectured that that
had interfered with the take up. But of course, you could never really tell
whether or not there would be significant take up if you had a standard
language that everybody was using until OWL came along,
when suddenly we had that thing. We had a standard KR language, that was kind of supported by lots of different
groups building infrastructure. And suddenly applications people started
to feel more comfortable about using that. It had always been an issue you know, if you were using the system from the
University of X, and then that research group just suddenly got bored with that
and went off and did something else. So you were left with no support,
whatever. Now with OWL you could
use a standard language, there were tons of people supporting it. Ever growing array of
infrastructure to support building, deploying,
maintaining ontologies, and that meant that people
then really started using it, I mean,
people in industry to some extent, but initially probably more other academic
disciplines, scientists, researchers. >> Did the whole industry sort of form,
that you wouldn’t have anticipated? >> Yeah,

Once you got the plumbing kind of right? >> But I mean, I think the thing struck me
initially about OWL was the fact that, once we just agreed we were
all gonna use 15-mil pipes, then we could have a huge industry
of people building all kinds of cool plumbing stuff that could
all fit together and do amazing stuff, that we would
never have anticipated in advance. And was the thing about OWL, it’s not
that OWL was the perfect language, it’s not that 15 mil is the right size for
a pipe. It’s just that some point,
you have to say, okay, you know, I would have gone for 18 mil,
you would have gone for 13 mil, we’ll compromise here,
we’ll fix on 15 mil, we’ll get on with it. And we’ll see what we can build with that. And this is what happened with OWL. There was huge arguments,
in the end we managed to reach a compromise that
everybody could sign up to. Some with more grumbling than others,
but everybody signed up to it. And then people started building stuff. And now the thing that amazes me today,
with both OWL and RDF actually, is that you just bump into people all the time
who are just building applications, using OWL and RDF tools and infrastructure,
that you never even knew about before. Because that stuff just works now. They just download it off the Web,
they build it into applications, they’re pretty happy. It may not be. If you look at what they
did, you may say well, you know, if I were doing that I’m not
even sure I would have really used RDF and OWL, but they downloaded the tools,
worked really nicely. They were super happy with it and
the toilet flushed in the end. Job done. >> Job done. >> And I think one of the things as a
community we haven’t been very good about is embracing that as a great success. And saying,
yeah we’ve done a really good job here. We’ve built stuff that people
are using and it works. It really kind of works
off the shelf these days. The tools and stuff are pretty robust. And we should be proud of that,
it’s a achievement of the community. [MUSIC]

Home » University of Michigan » Python for Everybody Specialization » Using Python to Access Web Data » Week 5: Web Services and XML (Chapter 13)

Week 5: Web Services and XML (Chapter 13)

Lectures

Video: 13.1 – Data on the Web

Video: 13.2 eXtensible Markup Language (XML)

Video: 13.3 – XML Schema

Video: 13.4 – Parsing XML

Video: Worked Example: XML (Chapter 13)

Assignments

Quiz: eXtensible Markup Language

Graded App Item: Extracting Data from XML

Bonus Material

Video: Interview: Roy Fielding – Understanding the REST Architecture

Video: Bonus: Office Hours – Boston

Video: Bonus Video: Ian Horrocks / RDF / OWL (Advanced)

Share this:

Like this: