CS145 Lecture Notes (5) -- XML Programming: XPath, SAX, DOM

Consider a database containing data encoded in XML. Don't worry about documents, namespaces, etc.
XML query: issued over database of XML, usually returns XML
Evolution of query languages for XML:
1. The database research road:
  1. Lore - invented here!
  2. XML-QL - AT&T Research
  3. Quilt - IBM Almaden Research
  4. XQuery - W3C standard
2. The standards road:
  1. XPath - path expressions and conditions
  2. XSLT - XPath plus transformations and output formatting
  3. XQuery - XPath + full query language
  XPath is also used in XLink and XPointer

XML DTD and sample data for examples

   <!ELEMENT Bookstore (Book | Magazine)*>
   <!ELEMENT Book (Title, Authors, Remark?)>
   <!ATTLIST Book ISBN CDATA #REQUIRED
             Price CDATA #REQUIRED
             Edition CDATA #IMPLIED>
   <!ELEMENT Magazine (Title)>
   <!ATTLIST Magazine Month CDATA #REQUIRED Year CDATA #REQUIRED> 
   <!ELEMENT Title (#PCDATA)>
   <!ELEMENT Authors (Author+)>
   <!ELEMENT Remark (#PCDATA)>
   <!ELEMENT Author (First_Name, Last_Name)>
   <!ELEMENT First_Name (#PCDATA)>
   <!ELEMENT Last_Name (#PCDATA)>

   <?xml version="1.0" standalone="no"?>
   <!DOCTYPE Bookstore SYSTEM "bookstore.dtd">
   <Bookstore>
      <Book ISBN="ISBN-0-13-035300-0" Price="$65" Edition="2nd">
         <Title>A First Course in Database Systems</Title>
         <Authors>
            <Author>
               <First_Name>Jeffrey</First_Name>
               <Last_Name>Ullman</Last_Name>
            </Author>
            <Author>
               <First_Name>Jennifer</First_Name>
               <Last_Name>Widom</Last_Name>
            </Author>
         </Authors>
      </Book>
      <Book ISBN="ISBN-0-13-031995-3" Price="$75">
         <Title>Database Systems: The Complete Book</Title>
         <Authors>
            <Author>
               <First_Name>Hector</First_Name>
               <Last_Name>Garcia-Molina</Last_Name>
            </Author>
            <Author>
               <First_Name>Jeffrey</First_Name>
               <Last_Name>Ullman</Last_Name>
            </Author>
            <Author>
               <First_Name>Jennifer</First_Name>
               <Last_Name>Widom</Last_Name>
            </Author>
         </Authors>
         <Remark>
         Amazon.com says: Buy this book bundled with "A First Course,"
         it's a great deal!
         </Remark>
      </Book>
   </Bookstore>

XPath

Think of XML as a tree (or directory) structure.

XPath specifies path expressions that match XML data by navigating down (and occasionally up or across) the tree.

Basic constructs (very incomplete list):

/ root element, or separator between steps in path
* matches any one element name
@X matches attribute X of the current element
// matches any descendant of the current element
[C] evaluates condition C on the current element
[N] picks the Nth matching element
contains(s1,s2) returns TRUE if string s1 contains string s2
name() returns tag of the current element
parent:: matches the parent of the current element
following-sibling:: matches all siblings after the current node
descendants:: matches any descendant of the current element
self:: matches the current element

Important:

XPath is standardized and is fairly stable.
XPath is a full tree-matching expression language; an XPath expresion evaluates to a set of nodes from an XML document.
XPath has some sytanctic shortcuts that make commonly used expressions very terse and natural.
XPath is designed to be forgiving of unexpected variation in XML structure.
We will cover the most commonly used subset
See the readings for (many!) more details.

(Example: all book titles)

(Example: all book or magazine titles)

(Example: all ISBN numbers)

(Example: all books costing < $70)

(Example: all ISBN numbers of books costing < $70)

(Example: all books containing a remark)

(Example: all titles of books costing < $70 where "Ullman" is an author)

(Example: same query using //)

(Example: all second authors anywhere)

(Example: all author last names anywhere)

(Example: all books whose title contains one of its author's last names)

(Example: all magazines where there is a book of the same title)

(Example: all books where there is a different book of the same title)

(Example: all elements whose parent tag is not "Book")

For next example modify DTD to contain Remark* instead of Remark?

(Example: all books where a Remark includes "great")

(Example: all books where all Remarks include "great")

SAX and DOM

APIs for handling parsed XML.
DOM (Document Object Model) gives you the XML tree as a big (object oriented) data structure in memory.
Useful for duoing major surgery on an XML document. (But XQuery often turns out to be easier.)
Useful if you need the tree explicitly, for example to convert it to a GUI tree widget.

(Example: count all words in an XML document)


...
Document d = parser.getDocument();
int numWords = countWordsInNode(d);
...


  public static int countWordsInNode(Node node) {
    
    int numWords = 0;
    
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        numWords += countWordsInNode(children.item(i));
      } 
    }  

    int type = node.getNodeType();
    if (type == Node.TEXT_NODE) {
      String s = node.getNodeValue();
      numWords += countWordsInString(s);
    }
    
    return numWords;  
    
  }

SAX (Simple API for XML) is indeed simpler.
With SAX, you imagine a scan of the XML tree which fires off lots of events. You can write handlers for these events, handling whatever you want to. You don't deal with the actual DOM tree.

(Pseuedocode Example: get all ISBNs)

`/`	root element, or separator between steps in path
`*`	matches any one element name
`@X`	matches attribute X of the current element
`//`	matches any descendant of the current element
`[C]`	evaluates condition C on the current element
`[N]`	picks the Nth matching element
`contains(s1,s2)`	returns `TRUE` if string `s1` contains string `s2`
`name()`	returns tag of the current element
`parent::`	matches the parent of the current element
`following-sibling::`	matches all siblings after the current node
`descendants::`	matches any descendant of the current element
`self::`	matches the current element