prev | Version 1011 (Mon Nov 27 20:46:06 2006) | next |
"<>"
<tagname>…</tagname>
<tagname/>
<X>…<Y>…</Y></X>
is legal…<X>…<Y>…</X></Y>
is not"<"
and ">"
&name;
Sequence | Character |
---|---|
< | < |
> | > |
" | " |
& | & |
Table 19.1: XML Character Escapes |
Tag | Usage |
---|---|
<html> | Root element of entire HTML document. |
<body> | Body of page (i.e., visible content). |
<h1> | Top-level heading. Use <h2> , <h3> , etc. for second- and third-level headings. |
<p> | Paragraph. |
<em> | Emphasized text; browser or editor will usually display it in italics. |
<address> | Address of document author (also usually displayed in italics). |
Table 19.2: Basic XHTML Tags |
<html> <body> <h1>Software Carpentry</h1> <p>This course will introduce <em>essential software development skills</em>, and show where and how they should be applied.</p> <address>Greg Wilson (gvwilson@third-bit.com)</address> </body> </html>
Figure 19.1: Simple Page Rendered by Firefox
<h1/>
(level-1 heading) is semantic (meaning)<i/>
(italics) is display (formatting)<h1 align="center">A Centered Heading</h1>
<p id="disclaimer" align="center">This planet provided as-is.</p>
<p align="left" align="right">…</p>
is illegal<p align=center>…<p>
, but modern parsers will reject it<head/>
element as well as a <body/>
<!--
, and end with -->
<html> <head> <title>Comments Page</title> <meta name="author" content="aturing"/> </head> <body> <!-- House style puts all titles in italics --> <h1><em>Welcome to the Comments Page</em></h1> <!-- Update this paragraph to describe the forum. --> <p>Welcome to the Comments Forum.</p> </body> </html>
<ul/>
for an unordered (bulleted) list, and <ol/>
for an ordered (numbered) one<li/>
<table/>
for tables<tr/>
(for “table row”)<td/>
(for “table data”)<html> <head> <title>Lists and Tables</title> <meta name="svn" content="$Id: xml.html,v 1.1 2008/03/26 18:58:45 scooter Exp $"/> </head> <body> <table cellpadding="3" border="1"> <tr> <td align="center"><em>Unordered List</em></td> <td align="center"><em>Ordered List</em></td> </tr> <tr> <td align="left" valign="top"> <ul> <li>Hydrogen</li> <li>Lithium</li> <li>Sodium</li> <li>Potassium</li> <li>Rubidium</li> <li>Cesium</li> <li>Francium</li> </ul> </td> <td align="left" valign="top"> <ol> <li>Helium</li> <li>Neon</li> <li>Argon</li> <li>Krypton</li> <li>Xenon</li> <li>Radon</li> </ol> </td> </tr> </table> </body> </html>
Figure 19.2: Lists and Tables
<meta/>
elements in document head<img/>
tagsrc
argument specifies where to find the image file<html> <head> <title>Images</title> <meta name="svn" content="$Id: xml.html,v 1.1 2008/03/26 18:58:45 scooter Exp $"/> </head> <body> <h1>Our Logo</h1> <img src="../../../img/sc_powered.jpg" alt="[Powered by Software Carpentry]"/> </body> </html>
Figure 19.3: Images in Pages
alt
attribute to specify alternative text<a/>
element to create a linkhref
attribute specifies what the link is pointing at<html> <head> <title>Links</title> <meta name="svn" content="$Id: xml.html,v 1.1 2008/03/26 18:58:45 scooter Exp $"/> </head> <body> <h1>A Few of My Favorite Places</h1> <ul> <li><a href="http://www.google.com">Google</a></li> <li><a href="http://www.python.org">Python</a></li> <li><a href="http://www.nature.com/index.html">Nature Online</a></li> <li>Examples in this lecture: <ul> <li><a href="comments.html">Comments</a></li> <li><a href="image.html">Images</a></li> <li><a href="list_table.html">Lists and Tables</a></li> </ul> </li> </ul> </body> </html>
Figure 19.4: Links in Pages
minidom
<root> <first>element</first> <second attr="value">element</second> <third-element/> </root>
Figure 19.5: A DOM Tree
ElementTree
use dictionaries instead<?xml version="1.0" encoding="utf-8"?> <planet name="Mercury"> <period units="days">87.97</period> </planet>
import xml.dom.minidom doc = xml.dom.minidom.parse('mercury.xml') print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <planet name="Mercury"> <period units="days">87.97</period> </planet>
toxml
method can be called on the document, or on any element node, to create textThe Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
for the detailsimport xml.dom.minidom my_xml = '''<name>Donald Knuth</name>''' my_doc = xml.dom.minidom.parseString(my_xml) name = my_doc.documentElement.firstChild.data print 'name is:', name print 'but name in full is:', repr(name)
name is: Donald Knuth but name in full is: u'Donald Knuth'
u
in front of the string the second time it is printedprint
statement converts the Unicode string to ASCII for displayimport xml.dom.minidom src = '''<planet name="Venus"> <period units="days">224.7</period> </planet>''' doc = xml.dom.minidom.parseString(src) print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <planet name="Venus"> <period units="days">224.7</period> </planet>
import xml.dom.minidom impl = xml.dom.minidom.getDOMImplementation() doc = impl.createDocument(None, 'planet', None) root = doc.documentElement root.setAttribute('name', 'Mars') period = doc.createElement('period') root.appendChild(period) text = doc.createTextNode('686.98') period.appendChild(text) print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <planet name="Mars"><period>686.98</period></planet>
xml.dom.minidom
is really just a wrapper around other platform-specific XML librariesdocument
nodecreateDocument
specifies the type of the document's root nodecreateDocument
aresetAttribute(attributeName, newValue)
<experimenter/>
nodes, extract names, and print a sorted listgetElementsByTagName
method to do thisimport xml.dom.minidom src = '''<heavenly_bodies> <planet name="Mercury"/> <planet name="Venus"/> <planet name="Earth"/> <moon name="Moon"/> <planet name="Mars"/> <moon name="Phobos"/> <moon name="Deimos"/> </heavenly_bodies>''' doc = xml.dom.minidom.parseString(src) for node in doc.getElementsByTagName('moon'): print node.getAttribute('name')
Moon Phobos Deimos
nodeType
ELEMENT_NODE
, TEXT_NODE
, ATTRIBUTE_NODE
, DOCUMENT_NODE
childNodes
data
import xml.dom.minidom src = '''<solarsystem> <planet name="Mercury"><period units="days">87.97</period></planet> <planet name="Venus"><period units="days">224.7</period></planet> <planet name="Earth"><period units="days">365.26</period></planet> </solarsystem> ''' def walkTree(currentNode, indent=0): spaces = ' ' * indent if currentNode.nodeType == currentNode.TEXT_NODE: print spaces + 'TEXT' + ' (%d)' % len(currentNode.data) else: print spaces + currentNode.tagName for child in currentNode.childNodes: walkTree(child, indent+1) doc = xml.dom.minidom.parseString(src) walkTree(doc.documentElement)
solarsystem TEXT (1) planet period TEXT (5) TEXT (1) planet period TEXT (5) TEXT (1) planet period TEXT (6) TEXT (1)
<em/>
element whose only child is a text node containing that wordFigure 19.6: Modifying the DOM Tree
<em/>
getElementsByTagName
, and iterate over themdef emphasize(doc): paragraphs = doc.getElementsByTagName('p') for para in paragraphs: first = para.firstChild if first.nodeType == first.TEXT_NODE: emphasizeText(doc, para, first)
def emphasizeText(doc, para, textNode): # Look for optional spaces, a word, and the rest of the paragraph. m = re.match(r'^(\s*)(\S*)\b(.*)$', str(textNode.data)) if not m: return leadingSpace, firstWord, restOfText = m.groups() if not firstWord: return # If there's text after the first word, re-save it. if restOfText: restOfText = doc.createTextNode(restOfText) para.insertBefore(restOfText, para.firstChild) # Emphasize the first word. emph = doc.createElement('em') emph.appendChild(doc.createTextNode(firstWord)) para.insertBefore(emph, para.firstChild) # If there's leading space, re-save it. if leadingSpace: leadingSpace = doc.createTextNode(leadingSpace) para.insertBefore(leadingSpace, para.firstChild) # Get rid of the original text. para.removeChild(textNode)
if __name__ == '__main__': src = '''<html><body> <p>First paragraph.</p> <p>Second paragraph contains <em>emphasis</em>.</p> <p>Third paragraph.</p> </body></html>''' doc = xml.dom.minidom.parseString(src) emphasize(doc) print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <html><body> <p><em>First</em> paragraph.</p> <p><em>Second</em> paragraph contains <em>emphasis</em>.</p> <p><em>Third</em> paragraph.</p> </body></html>
prev | Copyright © 2005-06 Python Software Foundation. | next |