CS145 Lecture Notes -- Data: Relations and XML

Terminology reminder

Data model: general conceptual way of structuring data
Schema: structure of a particular database under a certain data model
Instance: actual data conforming to a schema

Relational Model versus XML: Fundamental Differences

Relations: Schema must be fixed in advance
XML: Does not require predefined, fixed schema
Relations: Rigid flat table structure
XML: Flexible hierarchical structure (also arbitrary graphs)
Relations: Easy to understand, simple query language
XML: Can be harder to understand, more complex query language (if any)

Example Applications
1. University records: students, courses, grades, etc.
  Relations or XML? Why?
2. University Web site: news, academics, admissions, events, research, etc.
  Relations or XML? Why?
3. Family tree.
  Relations or XML? Why?
Application domain we use for example schemas and instances:
- Textbook: Movies
- Lectures: Booksellers, UC admissions
- Project: Online auction
The Relational Model
- Used by most commercial Database Managements Systems (DBMSs)
- Very simple model
- Enables simple, clean, declarative query languages
Database = set of relations (or tables), each with a distinct name
(Example: Student, Campus)
Each relation has a set of attributes (or columns), with a distinct name within its relation.
(Example: label Student, Campus)
Each tuple (or row) in a relation has a value for each attribute.
(Example: some tuples for Student, Campus relations)
Each attribute has a type (or domain).
```
  ID:char(9), name:char(25), GPA:float, age:integer
```
We'll use atomic (indivisible) types only for now.

schema = complete description of structure of relations in database: relation names, attribute names, types, etc.
instance = actual contents (tuples) of relations
Specifying the schema (on paper):
- Without types:
  Student(ID, name, GPA, age)
- With types:
  Student(ID: char(9), name: char(25), GPA: float, age: integer)
Naming conflicts:
```
    Student(ID, name, GPA, age)
    Campus(name, enrollment)      <- duplicate use of "name" okay
    Apply(ID, Campus)             <- "ID" okay, "Campus" not okay
```
Null values:
- Can use special null value in attribute of any type
- Can be problematic (e.g., for comparisons)
(Unknown age example)
Keys: A key for a relation is a set of attributes such that no two tuples can have the same values for all of their key attributes.
- Key values identify specific tuples.
- System may build special indexes over key values.
- Other tuples may use key values as logical "pointers".
- Specify keys by underlining (so no good way to specify multiple keys on paper)
(Example: keys for Student, Campus)
Relations in SQL
- Declared as above with "create table", see book and Oracle help documents for details
- SQL includes ways to add/drop attributes and specify automatic default values for attributes.
XML
If you're interested in reading about the XML standard in detail, visit:
```
   http://www.w3.org/XML/
```
It will keep you busy for days. A very short document linked from that site and the course schedule ("XML in 10 points") is required reading.

XML = Extensible Markup Language
- A (relatively) new standard for data representation and exchange, primarily intended for the internet
- A document format: a superset of HTML, a subset of SGML (roughly)
- Some say that XML is to data what Java is to programming.
- Take note:
  - XML can be very clunky.
  - The official specification is enormous, but the basic idea is very simple. In this class we care about the basic idea.
Like SGML and HTML, basic XML consists of three things:
1. Tagged elements, which may be nested within one another
2. Attributes on elements
3. Text
In HTML, tags denote formatting: <Title>, <I>, <Table>, etc.
In XML, tags denote meaning of data: <Student>, <Book_Title>, etc.
One way to format XML data is to use XSL - the Extensible Stylesheet Language - to translate XML to HTML. We'll probably cover XSL later in the course.
Well-formed XML
A well-formed XML document is any XML document that follows the basic rules: single root element, matched tags, unique attribute names, etc.
Example: bookstore data
```
   <?xml version="1.0" standalone="yes"?>
   <Bookstore>
      <Book ISBN="ISBN-0-13-035300-0" Price="$65" Edition="2nd">
         <Title>A First Course in Database Systems</Title>
         <Authors>
            <Author>
               <First_Name>Jeffrey</First_Name>
               <Last_Name>Ullman</Last_Name>
            </Author>
            <Author>
               <First_Name>Jennifer</First_Name>
               <Last_Name>Widom</Last_Name>
            </Author>
         </Authors>
      </Book>
      <Book ISBN="ISBN-0-13-031995-3" Price="$75">
         <Title>Database Systems: The Complete Book</Title>
         <Authors>
            <Author>
               <First_Name>Hector</First_Name>
               <Last_Name>Garcia-Molina</Last_Name>
            </Author>
            <Author>
               <First_Name>Jeffrey</First_Name>
               <Last_Name>Ullman</Last_Name>
            </Author>
            <Author>
               <First_Name>Jennifer</First_Name>
               <Last_Name>Widom</Last_Name>
            </Author>
         </Authors>
         <Remark>
         Amazon.com says: Buy this book bundled with "A First Course,"
         it's a great deal!
         </Remark>
      </Book>
   </Bookstore>
```
A well-formed XML document can contain regular data (as above) or very irregular data.

Valid XML
It's possible to define a kind of schema for XML data, called a Document Type Descriptor (DTD).
A DTD is a grammar that describes the legal attributes of tagged elements and the legal ordering and nesting of the elements.
Example DTD
```
   <!ELEMENT Bookstore (Book | Magazine)*>
   <!ELEMENT Book (Title, Authors, Remark?)>
   <!ATTLIST Book ISBN CDATA #REQUIRED
                Price CDATA #REQUIRED
                Edition CDATA #IMPLIED>
   <!ELEMENT Magazine (Title)>
   <!ATTLIST Magazine Month CDATA #REQUIRED Year CDATA #REQUIRED> 
   <!ELEMENT Title (#PCDATA)>
   <!ELEMENT Authors (Author+)>
   <!ELEMENT Remark (#PCDATA)>
   <!ELEMENT Author (First_Name, Last_Name)>
   <!ELEMENT First_Name (#PCDATA)>
   <!ELEMENT Last_Name (#PCDATA)>
```
The DTD is specified at the top of the document or in a separate file referenced at the top of the document. In both cases use standalone="no".
Question: What are the benefits of using a DTD?
Question: What are the benefits of not using a DTD?
ID and IDREF(S) Attributes
Element pointers: assign a special ID attribute to an element, then point to that element with a special IDREF or IDREFS attribute in another element.
Example: reorganized bookstore
```
   <?xml version="1.0" standalone="no"?>
   <!DOCTYPE Bookstore SYSTEM "bookstore.dtd">
   <Bookstore>
      <Book ISBN="ISBN-0-13-035300-0" Price="$65" Edition="2nd" Authors="JU JW">
         <Title>A First Course in Database Systems</Title>
      </Book>
      <Book ISBN="ISBN-0-13-031995-3" Price="$75" Authors="HG JU JW">
         <Title>Database Systems: The Complete Book</Title>
         <Remark>
            Amazon.com says: Buy this book bundled with
            <BookRef book="ISBN-0-13-035300-0"></BookRef>,
            It's a great deal!
         </Remark>
      </Book>
      <Author Ident="HG">
         <First_Name>Hector</First_Name>
         <Last_Name>Garcia-Molina</Last_Name>
      </Author>
      <Author Ident="JU">
         <First_Name>Jeffrey</First_Name>
         <Last_Name>Ullman</Last_Name>
      </Author>
      <Author Ident="JW">
         <First_Name>Jennifer</First_Name>
         <Last_Name>Widom</Last_Name>
      </Author>
   </Bookstore>
```
DTD for this data:
```
   <!ELEMENT Bookstore (Book*, Author*)>
   <!ELEMENT Book (Title, Remark?)>
   <!ATTLIST Book ISBN ID #REQUIRED
                Price CDATA #REQUIRED
                Edition CDATA #IMPLIED
                Authors IDREFS #REQUIRED>
   <!ELEMENT Title (#PCDATA)>
   <!ELEMENT Remark (#PCDATA | BookRef)*>
   <!ELEMENT BookRef EMPTY>
   <!ATTLIST BookRef book IDREF #REQUIRED>
   <!ELEMENT Author (First_Name, Last_Name)>
   <!ATTLIST Author Ident ID #REQUIRED>
   <!ELEMENT First_Name (#PCDATA)>
   <!ELEMENT Last_Name (#PCDATA)>
```
Digression on (#P)CDATA
The details are messy (recall comments about clunky XML and official specs above), but overall:
- #PCDATA is only for element content, CDATA is only for attribute types, you cannot use them interchangeably, and there are no keywords PCDATA or #CDATA. Both CDATA and #PCDATA essentially specify text where the special characters "<", ">", and "&" have to be escaped as "<", ">" and "&" respectively.
- Recommendation: Use CDATA for string-valued attributes, use #PCDATA for elements containing text. If you want an element to contain a mixture of text and other elements, do so by specifying the element types along with #PCDATA in a 0-or-more list, e.g., (#PCDATA | Author | Editor)*.
XML Coverage
- Not covering: Namespaces, XLink, XPointer
- Covering later: XPath, XQuery
- May or may not cover: XSLT, XML Schema
Relational Model versus XML
(First three are same as earlier)
1. Relations: Schema must be fixed in advance
  XML: Does not require predefined, fixed schema
2. Relations: Rigid flat table structure
  XML: Flexible hierarchical structure; graphs
3. Relations: Easy to understand, simple query language
  XML: Can be harder to understand, more complex query language (if any)
4. Relations: Ordering of data not relevant (tuple ordering or attribute ordering)
  XML: Ordering forced by document format, may or may not be relevant
5. Relations: Transmission and sharing can be problematic
  XML: Designed for easy representation and exchange
6. Relations: "Native" data model for all current serious commercial DBMSs
  XML: "Add-on," often implemented on top of relations

CS145 Lecture Notes -- Data: Relations and XML

Relational Model versus XML: Fundamental Differences

Example Applications

The Relational Model

Relations in SQL

XML

Well-formed XML

Valid XML

Example DTD

ID and IDREF(S) Attributes

Example: reorganized bookstore

Digression on (#P)CDATA

XML Coverage

Relational Model versus XML