XML

Version 1011 (Mon Nov 27 20:46:06 2006)

XML is becoming the standard way to store everything from web pages to astronomical data
- Bewildering variety of tools for dealing with it
- And more appearing every day
This lecture describes how to process and modify XML
- Warning: the standards are more complex than they should have been
Reading:
- [Castro 2002] if all you care about is HTML
- [Castro 2000] if you want to know more about XML
- [Harold 2004] if you want to become an expert
Send comments

You Can Skip This Lecture If...

You know what the difference is between HTML and XML
You know what elements, attributes, and entities are
You know how to create a hyperlink in an HTML page
You know what DOM is
You know how to search an XML document
Send comments

In the Beginning

1969-1986: Standard Generalized Markup Language (SGML)
- Developed by Charles Goldfarb and others at IBM
- A way of adding information to medical and legal documents so that computers could process them
- Very complex specification (over 500 pages)
1989: Tim Berners-Lee creates HyperText Markup Language (HTML) for the World Wide Web
- Much (much) simpler than SGML
- Anyone could write it, so everyone did
Send comments

The Modern Era

Problem: HTML had a small, fixed set of tags
- Everyone wanted to add new ones
- Solution: create a standard way to define a set of tags, and the relationships between them
First version of XML standardized in 1998
- A set of rules for defining markup languages
- Much more complex than HTML, but still simpler than SGML
New version of HTML called XHTML was also defined
- Like HTML, but obeys all XML rules
- Still a lot of non-XML compliant HTML out there
Send comments

Formatting Rules

A basic XML document contains elements and text
- Full spec allows for external entity references, processing instructions, and other fun
Elements are shown using tags
- Must be enclosed in angle brackets "<>"
- Full form: <tagname>…</tagname>
- Short form (if the element doesn't contain anything): <tagname/>
Send comments

Document Structure

Elements must be properly nested
- If Y starts inside X, Y must end before X ends
- So <X>…<Y>…</Y></X> is legal…
- …but <X>…<Y>…</X></Y> is not
Every document must have a single root element
- I.e., a single element must enclose everything else
Specific XML dialects may restrict which elements can appear inside which others
- XHTML is very liberal
- MathML (Mathematical Markup Language) is stricter
Send comments

Text

Text is normal printable text
Must use escape sequences to represent "<" and ">"
- In XML, written &name;
- Sequence Character
  < <
  > >
  " "
  & &
  Table 19.1: XML Character Escapes
Send comments

Sequence	Character
`<`	`<`
`>`	`>`
`"`	`"`
`&`	`&`
Table 19.1: XML Character Escapes

XHTML

Most common use of XML is still XHTML (the XML version of hypertext)

Basic tags:

Tag	Usage
`<html>`	Root element of entire HTML document.
`<body>`	Body of page (i.e., visible content).
`<h1>`	Top-level heading. Use `<h2>`, `<h3>`, etc. for second- and third-level headings.
`<p>`	Paragraph.
`<em>`	Emphasized text; browser or editor will usually display it in italics.
`<address>`	Address of document author (also usually displayed in italics).
Table 19.2: Basic XHTML Tags

Send comments

Sample XHTML Page

<html>
<body>
<h1>Software Carpentry</h1>

<p>This course will introduce <em>essential software development skills</em>,
and show where and how they should be applied.</p>

<address>Greg Wilson (gvwilson@third-bit.com)</address>
</body>
</html>

Figure 19.1: Simple Page Rendered by Firefox

Send comments

Critique of HTML/XHTML

HTML and XHTML mix semantics and display
- <h1/> (level-1 heading) is semantic (meaning)
- <i/> (italics) is display (formatting)
Now generally considered a bad thing
- Modern HTML/XHTML documents contain semantic tags only
- Control display using Cascading Style Sheets (CSS)
  - Out of the scope of this course
Send comments

Attributes

Elements can be customized by giving them attributes
- Enclosed in the opening tag
- <h1 align="center">A Centered Heading</h1>
- <p id="disclaimer" align="center">This planet provided as-is.</p>
An attribute name may appear at most once in any element
- Like keys in a dictionary
- So <p align="left" align="right">…</p> is illegal
Values must be quoted
- Old-style browsers accepted <p align=center>…<p>, but modern parsers will reject it
- Must use escape sequences for angle brackets, quotes, etc. inside values
Send comments

Attributes Vs. Elements

Use attributes when:
- Each value can occur at most once for any element
- The order of the values doesn't matter
- Those values have no internal structure
In all other cases, use nested elements
- If you have to parse an attribute's value to figure out what it means, use an element instead
Send comments

More XHTML Tags

Well-written HTML pages have a <head/> element as well as a <body/>
- Contains metadata about the page

Well-written pages also use comments (just like code)

Introduce with

<html>
<head>
  <title>Comments Page</title>
  <meta name="author" content="aturing"/>
</head>
<body>

<!-- House style puts all titles in italics -->
<h1><em>Welcome to the Comments Page</em></h1>

<!-- Update this paragraph to describe the forum. -->
<p>Welcome to the Comments Forum.</p>

</body>
</html>

Unfortunately, comments cannot be nested

Send comments

Lists and Tables

Use <ul/> for an unordered (bulleted) list, and <ol/> for an ordered (numbered) one
- Each list item is wrapped in <li/>
Use <table/> for tables
- Each row is wrapped in <tr/> (for “table row”)
- Within each row, column items are wrapped in <td/> (for “table data”)
- Note: tables are often used to force multi-column layout, as well as for tabular data
Send comments

Example

<html>
<head>
  <title>Lists and Tables</title>
  <meta name="svn" content="$Id: xml.html,v 1.1 2008/03/26 18:58:45 scooter Exp $"/>
</head>
<body>

<table cellpadding="3" border="1">
  <tr>
    <td align="center"><em>Unordered List</em></td>
    <td align="center"><em>Ordered List</em></td>
  </tr>
  <tr>
    <td align="left" valign="top">
      <ul>
        <li>Hydrogen</li>
        <li>Lithium</li>
        <li>Sodium</li>
        <li>Potassium</li>
        <li>Rubidium</li>
        <li>Cesium</li>
        <li>Francium</li>
      </ul>
    </td>
    <td align="left" valign="top">
      <ol>
        <li>Helium</li>
        <li>Neon</li>
        <li>Argon</li>
        <li>Krypton</li>
        <li>Xenon</li>
        <li>Radon</li>
      </ol>
    </td>
  </tr>
</table>

</body>
</html>

Figure 19.2: Lists and Tables

Note how Subversion keywords have been put in <meta/> elements in document head
- Automatically updated each time the document is committed to version control
Send comments

Images

How to put an image in a page?
- XML documents can only contain text, so you can't store an image or audio clip directly in a page
Usual solution is to store a reference to the external file using the <img/> tag
- The src argument specifies where to find the image file

<html>
<head>
  <title>Images</title>
  <meta name="svn" content="$Id: xml.html,v 1.1 2008/03/26 18:58:45 scooter Exp $"/>
</head>
<body>

<h1>Our Logo</h1>

<img src="../../../img/sc_powered.jpg" alt="[Powered by Software Carpentry]"/>

</body>
</html>

Figure 19.3: Images in Pages

Always use the alt attribute to specify alternative text
- Screen readers for people with visual handicaps use this instead of the image
- And it's good documentation for search engines
Send comments

Links

Links to other pages is what makes it “hypertext”
Use the <a/> element to create a link
- The text inside the element is displayed and (usually) underlined for clicking
- The href attribute specifies what the link is pointing at
- Both local filenames and URLs are supported

<html>
<head>
  <title>Links</title>
  <meta name="svn" content="$Id: xml.html,v 1.1 2008/03/26 18:58:45 scooter Exp $"/>
</head>
<body>

<h1>A Few of My Favorite Places</h1>

<ul>
  <li><a href="http://www.google.com">Google</a></li>
  <li><a href="http://www.python.org">Python</a></li>
  <li><a href="http://www.nature.com/index.html">Nature Online</a></li>
  <li>Examples in this lecture:
    <ul>
      <li><a href="comments.html">Comments</a></li>
      <li><a href="image.html">Images</a></li>
      <li><a href="list_table.html">Lists and Tables</a></li>
    </ul>
  </li>
</ul>

</body>
</html>

Figure 19.4: Links in Pages

Send comments

The Document Object Model

The Document Object Model (DOM) is a cross-language standard for representing XML documents as trees
- One node for each element, attribute, or text
Pro:
- Much easier to manipulate trees than strings
- Same basic model in many different languages (which lowers the learning cost)
Con:
- Needs a lot of memory for large documents
- Generic standard doesn't take advantage of the more advanced features of some languages
Python's standard library includes a simple implementation of DOM called minidom
- Fast, sturdy, and well documented…
- …if you understand all the terminology, and know more or less what you're looking for
Send comments

The Basics

Every DOM tree has a single root representing the document as a whole
- Doesn't correspond to anything that's actually in the document
This element has a single child, which is the root node of the document
It, and other element nodes, may have three types of children:
- Other elements
- Text nodes
- Attribute nodes
Send comments

DOM Tree Example

<root>
  <first>element</first>
  <second attr="value">element</second>
  <third-element/>
</root>

Figure 19.5: A DOM Tree

Send comments

Creating a Tree

Usual way to create a DOM tree is to parse a file

<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
  <period units="days">87.97</period>
</planet>

import xml.dom.minidom
doc = xml.dom.minidom.parse('mercury.xml')
print doc.toxml('utf-8')

<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
  <period units="days">87.97</period>
</planet>

Send comments

Converting to Text

The toxml method can be called on the document, or on any element node, to create text
DOM trees always store text as Unicode, so when you're converting the tree to text, you must tell the library how to represent characters
- Example above uses UTF-8, which is the best default choice
- See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for the details

This means that strings taken from XML documents are Unicode, not ASCII

import xml.dom.minidom

my_xml = '''<name>Donald Knuth</name>'''
my_doc = xml.dom.minidom.parseString(my_xml)
name = my_doc.documentElement.firstChild.data
print 'name is:', name
print 'but name in full is:', repr(name)

name is: Donald Knuth
but name in full is: u'Donald Knuth'

Note the u in front of the string the second time it is printed
- A simple print statement converts the Unicode string to ASCII for display

Send comments

The Details

xml.dom.minidom is really just a wrapper around other platform-specific XML libraries
- Have to reach inside it and get the underlying implementation object to create the document node
- That node then knows how to create other elements in the document
- Middle argument to createDocument specifies the type of the document's root node
- Documentation explains what the first and third arguments to createDocument are
Add new nodes to existing ones by:
- Asking the document to create the node
- Appending it to a node that's already part of the tree
Set attributes of element nodes using setAttribute(attributeName, newValue)
- Remember, all attribute values are strings
- If you want to store an integer or a Boolean, you have to convert it yourself
Send comments

Finding Nodes

Often want to do things to all elements of a particular type
- E.g., find all <experimenter/> nodes, extract names, and print a sorted list
Use the getElementsByTagName method to do this
- Returns a list of all the descendents of a node with the specified tag

import xml.dom.minidom

src = '''<heavenly_bodies>
  <planet name="Mercury"/>
  <planet name="Venus"/>
  <planet name="Earth"/>
  <moon name="Moon"/>
  <planet name="Mars"/>
  <moon name="Phobos"/>
  <moon name="Deimos"/>
</heavenly_bodies>'''

doc = xml.dom.minidom.parseString(src)
for node in doc.getElementsByTagName('moon'):
    print node.getAttribute('name')

Moon
Phobos
Deimos

Question: what happens if you add or delete nodes while looping over this list?
Send comments

Walking a Tree

Often want to visit each node in the tree
- E.g., print an outline of the document showing element nesting
Node's type is stored in a member variable called nodeType
- ELEMENT_NODE, TEXT_NODE, ATTRIBUTE_NODE, DOCUMENT_NODE
If a node is an element, its children are stored in a read-only list called childNodes
If a node is a text node, the actual text is in the member data
Send comments

Recursive Tree Walker

import xml.dom.minidom

src = '''<solarsystem>
<planet name="Mercury"><period units="days">87.97</period></planet>
<planet name="Venus"><period units="days">224.7</period></planet>
<planet name="Earth"><period units="days">365.26</period></planet>
</solarsystem>
'''

def walkTree(currentNode, indent=0):
    spaces = ' ' * indent
    if currentNode.nodeType == currentNode.TEXT_NODE:
        print spaces + 'TEXT' + ' (%d)' % len(currentNode.data)
    else:
        print spaces + currentNode.tagName
        for child in currentNode.childNodes:
            walkTree(child, indent+1)

doc = xml.dom.minidom.parseString(src)
walkTree(doc.documentElement)

solarsystem
 TEXT (1)
 planet
  period
   TEXT (5)
 TEXT (1)
 planet
  period
   TEXT (5)
 TEXT (1)
 planet
  period
   TEXT (6)
 TEXT (1)

Traversing a tree like this is just one of many recurring patterns in object-oriented programming
- We'll discuss them briefly in Backward, Forward, and Sideways
Send comments

Modifying the Tree

Modifying trees in place is a little bit tricky
- Helps to draw lots of pictures
Example: want to emphasize the first word of each paragraph
- Get the text node below the paragraph
- Take off the first word
- Insert a new <em/> element whose only child is a text node containing that word
- Figure 19.6: Modifying the DOM Tree
Send comments

Complications

But what if the first child of the paragraph already has some markup around it?
- E.g., what if the paragraph starts with a link?
Could just wrap the first child with <em/>
- But if (for example) the link contains several words, this will look wrong
We'll ignore this problem for now
Send comments

Solution

Step 1: find all the paragraphs using getElementsByTagName, and iterate over them

def emphasize(doc):
    paragraphs = doc.getElementsByTagName('p')
    for para in paragraphs:
        first = para.firstChild
        if first.nodeType == first.TEXT_NODE:
            emphasizeText(doc, para, first)

Step 2: break the paragraph text into pieces, and handle each piece in turn

Create a new node for each piece
Push it onto the front of the paragraph's child list
Once they've all been handled, get rid of the original text node

def emphasizeText(doc, para, textNode):

    # Look for optional spaces, a word, and the rest of the paragraph.
    m = re.match(r'^(\s*)(\S*)\b(.*)$', str(textNode.data))
    if not m:
        return
    leadingSpace, firstWord, restOfText = m.groups()
    if not firstWord:
        return

    # If there's text after the first word, re-save it.
    if restOfText:
        restOfText = doc.createTextNode(restOfText)
        para.insertBefore(restOfText, para.firstChild)

    # Emphasize the first word.
    emph = doc.createElement('em')
    emph.appendChild(doc.createTextNode(firstWord))
    para.insertBefore(emph, para.firstChild)

    # If there's leading space, re-save it.
    if leadingSpace:
        leadingSpace = doc.createTextNode(leadingSpace)
        para.insertBefore(leadingSpace, para.firstChild)

    # Get rid of the original text.
    para.removeChild(textNode)

Send comments

Not Finished Yet

Part 3: test it

Yes, it really is part of the program

if __name__ == '__main__':

    src = '''<html><body>
<p>First paragraph.</p>
<p>Second paragraph contains <em>emphasis</em>.</p>
<p>Third paragraph.</p>
</body></html>'''

    doc = xml.dom.minidom.parseString(src)
    emphasize(doc)
    print doc.toxml('utf-8')

<?xml version="1.0" encoding="utf-8"?>
<html><body>
<p><em>First</em> paragraph.</p>
<p><em>Second</em> paragraph contains <em>emphasis</em>.</p>
<p><em>Third</em> paragraph.</p>
</body></html>

Send comments

Summary

There's a lot of hype in hypertext
- Haven't yet heard anyone claim that XML will cure the common cold, but I'm sure it's been said
Pros:
- One set of rules for people to learn
- Never have to write a parser again
  - At least, the low-level syntactic bits—still need to figure out what all those tags mean
Cons:
- Raw XML is hard to read
  - Particularly if it has been generated by a machine
- A lot of data isn't actually trees
  - When storing a 2D matrix or a table, you have to organize data by row or by column…
  - …either of which makes the other hard to access
- There are a lot of complications and subtleties
  - Most applications ignore most of them
  - Which means that they fail (usually badly) when confronted with something outside the subset they understand
Like Inglish speling, it's here to stay
Send comments

XML

Introduction