CS145 Lecture Notes (1) -- Introductory Material, Data Models

This lecture material complements the assigned readings:

Chapter 1 of the textbook, and the Introduction to "SQL for Web Nerds" by Philip Greenspun (linked from the Course Schedule).
For Relations: textbook section 3.1
For XML: textbook section 4.7, online XML and DTD tutorials linked from the schedule

Database Management System (DBMS):
Provides efficient, convenient, and safe multi-user storage of and access to massive amounts of persistent data

Most familiar use: many Web sites rely heavily on DBMS's. (And you will build one!)

Examples: (solicit from class)

Non-Web examples of DBMS's: (solicit from class)

Example: Online Bookseller

Data = information on books (including categories, bestsellers, etc.), customers, pending orders, order histories, trends and preferences, etc.

Massive: many gigabytes at a minimum for medium-size bookseller, more if keep all order histories over all time, even more if keep clickstream logs, even more if keep images of book covers and sample pages
=> Far too big for memory
Persistent: data outlives programs that operate on it
Multi-user: many people/programs accessing same database, or even same data, simultaneously
=> Need careful controls

Multi-user example and discussion:

Jane and John both have ID number for bookseller gift certificate (credit) of $200 they got as a wedding gift.

Jane @ her office: orders "Database Systems: The Complete Book" ($75)

  prompt user for credit ID;
  get credit from database;
  if credit >= 75 then
      credit := credit - 75;
      issue order to mail book;
      put new credit into database;
  else print "sorry"

John @ his office: orders "A First Course in Database Systems" ($65)

  prompt user for credit ID;
  get credit from database;
  if credit >= 65 then
      credit := credit - 65;
      issue order to mail book;
      put new credit into database;
  else print "sorry"

Initial credit = $200
Ending credit = ???

Appears similar to concurrent programming problems (synchronization, semaphores, etc.)
BUT: data not main-memory variables

Appears similar to file system concurrent access
BUT: want to control at smaller granularity

Also database may be distributed, replicated, etc.

Safe: (1) From system failures (2) From malicious users (3) From inconsistent state
Convenient: Simple commands to - find a specific book, list all books in a certain category and price range, generate an order history, produce sales figures grouped by state, etc.
=> Also unpredicted queries should be easy
Efficient: Don't search all files in order to - get price of one book, get all customers from northern California, get bestselling books from last week
Massive data => DBMS's carefully tuned for performance

A DBMS is a Software System

Buy, install, set up for particular application
Available for PC's, workstations, mainframes, supercomputers
Frequently used in conjunction with application server or other middleware
Basic software infrastructure: (figure)

Major DBMS Products and Freeware

(Fewer vendors every year...)

Oracle
IBM: DB2, Informix
Microsoft: SQL Server, Access
Sybase
MySQL
Postgres

All are "relational" (or "object-relational") database systems at their core.

All commercial vendors also provide XML support, either built-in or built-on-top.

People

DBMS implementor: builds system (245, 346)
Database designer: establishes schema (145)
Database administrator: loads data, keeps whole thing running (145)
Database user: queries/modifies data (145)

Less Traditional Database Applications

Stock monitoring, air traffic, telecom call records: real-time data streams, historical data and queries, "active" database
Distributed, heterogeneous databases: collaborative design, medical information systems, data warehousing, "querying the Web"
Scientific data - e.g., satellite, sensors, X-ray: terabytes or much more, fancy interfaces, specialized query languages

Data Models

Terminology

Data model: general conceptual way of structuring data
Schema: structure of a particular database under a certain data model
Instance: actual data conforming to a schema

Relational Model versus XML: Fundamental Differences

Relations: Schema must be fixed in advance
XML: Does not require predefined, fixed schema
Relations: Rigid flat table structure
XML: Flexible hierarchical structure (also arbitrary graphs)
Relations: Easy to understand, simple query language
XML: Can be harder to understand, more complex query language (if any)

Example Applications

University records: students, courses, grades, etc.
Relations or XML? Why?
University Web site: news, academics, admissions, events, research, etc.
Relations or XML? Why?
Family tree.
Relations or XML? Why?

Application domain we use for example schemas and instances:

Textbook: Movies
Lectures: Booksellers, UC admissions, etc.
Project: Online auction

The Relational Model

Used by most commercial Database Managements Systems (DBMSs)
Very simple model
Enables simple, clean, declarative query languages

Database = set of relations (or tables), each with a distinct name

(Example: Student, Campus)

Each relation has a set of attributes (or columns), with a distinct name within its relation.

(Example: label Student, Campus)

Each tuple (or row) in a relation has a value for each attribute.

(Example: some tuples for Student, Campus relations)

Each attribute has a type (or domain).

    ID:char(9), name:char(25), GPA:float, age:integer

We'll use atomic (indivisible) types only for now.

schema = complete description of structure of relations in database: relation names, attribute names, types, etc.
instance = actual contents (tuples) of relations

Specifying the schema (on paper):

Without types:
Student(ID, name, GPA, age)
With types:
Student(ID: char(9), name: char(25), GPA: float, age: integer)

Naming conflicts:

    Student(ID, name, GPA, age)
    Campus(name, enrollment)      <- duplicate use of "name" okay
    Apply(ID, Campus)             <- "ID" okay, "Campus" not okay

Null values:

Can use special null value in attribute of any type
Can be problematic (e.g., for comparisons)

(Unknown age example)

Keys: A key for a relation is a set of attributes such that no two tuples can have the same values for all of their key attributes.

Key values identify specific tuples.
System may build special indexes over key values.
Other tuples may use key values as logical "pointers".
Specify keys by underlining (so no good way to specify multiple keys on paper)

(Example: keys for Student, Campus)

XML

If you're interested in reading about the XML standard in detail, visit:

http://www.w3.org/XML/

It will keep you busy for days. A very short document linked from that site and the course schedule ("XML in 10 points") is required reading.

XML = Extensible Markup Language

A relatively new standard for data representation and exchange, intended initially for the internet
A document format: a superset of HTML, a subset of SGML (roughly)
Some say that XML is to data what Java is to programming.
Take note:
- XML can be clunky.
- The official specification is enormous, but the basic idea is very simple. In this class we care about the basic idea.

Like SGML and HTML, basic XML consists of three things:

Tagged elements, which may be nested within one another
Attributes on elements
Text

In HTML, tags denote formatting: <Title>, <I>, <Table>, etc.

In XML, tags denote meaning of data: <Student>, <Book_Title>, etc.

XML data can be formatted using CSS (Cascading Style Sheets) or XSL (Extensible Stylesheet Language) to translate XML to HTML.

Well-formed XML

A well-formed XML document is any XML document that follows the basic rules: single root element, matched tags, unique attribute names, etc.

Example: bookstore data

   <?xml version="1.0" standalone="yes"?>
   <Bookstore>
      <Book ISBN="ISBN-0-13-035300-0" Price="$65" Edition="2nd">
         <Title>A First Course in Database Systems</Title>
         <Authors>
            <Author>
               <First_Name>Jeffrey</First_Name>
               <Last_Name>Ullman</Last_Name>
            </Author>
            <Author>
               <First_Name>Jennifer</First_Name>
               <Last_Name>Widom</Last_Name>
            </Author>
         </Authors>
      </Book>
      <Book ISBN="ISBN-0-13-031995-3" Price="$75">
         <Title>Database Systems: The Complete Book</Title>
         <Authors>
            <Author>
               <First_Name>Hector</First_Name>
               <Last_Name>Garcia-Molina</Last_Name>
            </Author>
            <Author>
               <First_Name>Jeffrey</First_Name>
               <Last_Name>Ullman</Last_Name>
            </Author>
            <Author>
               <First_Name>Jennifer</First_Name>
               <Last_Name>Widom</Last_Name>
            </Author>
         </Authors>
         <Remark>
         Amazon.com says: Buy this book bundled with "A First Course,"
         it's a great deal!
         </Remark>
      </Book>
   </Bookstore>

A well-formed XML document can contain regular data (as above) or very irregular data.

Valid XML

It's possible to define a kind of schema for XML data, called a Document Type Descriptor (DTD).

A DTD is a grammar that describes the legal attributes of tagged elements and the legal ordering and nesting of the elements.

Example DTD

   <!ELEMENT Bookstore (Book | Magazine)*>
   <!ELEMENT Book (Title, Authors, Remark?)>
   <!ATTLIST Book ISBN CDATA #REQUIRED
                Price CDATA #REQUIRED
                Edition CDATA #IMPLIED>
   <!ELEMENT Magazine (Title)>
   <!ATTLIST Magazine Month CDATA #REQUIRED Year CDATA #REQUIRED> 
   <!ELEMENT Title (#PCDATA)>
   <!ELEMENT Authors (Author+)>
   <!ELEMENT Remark (#PCDATA)>
   <!ELEMENT Author (First_Name, Last_Name)>
   <!ELEMENT First_Name (#PCDATA)>
   <!ELEMENT Last_Name (#PCDATA)>

The DTD is specified at the top of the document or in a separate file referenced at the top of the document. In both cases use standalone="no".

Question: What are the benefits of using a DTD?

Question: What are the benefits of not using a DTD?

ID and IDREF(S) Attributes

Element pointers: assign a special ID attribute to an element, then point to that element with a special IDREF or IDREFS attribute in another element.

Example: reorganized bookstore

   <?xml version="1.0" standalone="no"?>
   <!DOCTYPE Bookstore SYSTEM "bookstore.dtd">
   <Bookstore>
      <Book ISBN="ISBN-0-13-035300-0" Price="$65" Edition="2nd" Authors="JU JW">
         <Title>A First Course in Database Systems</Title>
      </Book>
      <Book ISBN="ISBN-0-13-031995-3" Price="$75" Authors="HG JU JW">
         <Title>Database Systems: The Complete Book</Title>
         <Remark>
            Amazon.com says: Buy this book bundled with
            <BookRef book="ISBN-0-13-035300-0" />,
            It's a great deal!
         </Remark>
      </Book>
      <Author Ident="HG">
         <First_Name>Hector</First_Name>
         <Last_Name>Garcia-Molina</Last_Name>
      </Author>
      <Author Ident="JU">
         <First_Name>Jeffrey</First_Name>
         <Last_Name>Ullman</Last_Name>
      </Author>
      <Author Ident="JW">
         <First_Name>Jennifer</First_Name>
         <Last_Name>Widom</Last_Name>
      </Author>
   </Bookstore>

DTD for this data:

   <!ELEMENT Bookstore (Book*, Author*)>
   <!ELEMENT Book (Title, Remark?)>
   <!ATTLIST Book ISBN ID #REQUIRED
                Price CDATA #REQUIRED
                Edition CDATA #IMPLIED
                Authors IDREFS #REQUIRED>
   <!ELEMENT Title (#PCDATA)>
   <!ELEMENT Remark (#PCDATA | BookRef)*>
   <!ELEMENT BookRef EMPTY>
   <!ATTLIST BookRef book IDREF #REQUIRED>
   <!ELEMENT Author (First_Name, Last_Name)>
   <!ATTLIST Author Ident ID #REQUIRED>
   <!ELEMENT First_Name (#PCDATA)>
   <!ELEMENT Last_Name (#PCDATA)>

Digression on (#P)CDATA

The details are messy (recall comments about clunky XML and official specs above), but overall:

#PCDATA is only for element content, CDATA is only for attribute types, you cannot use them interchangeably, and there are no keywords PCDATA or #CDATA. Both CDATA and #PCDATA essentially specify text where the special characters "<", ">", and "&" have to be escaped as "<", ">" and "&" respectively.
Recommendation: Use CDATA for string-valued attributes, use #PCDATA for elements containing text. If you want an element to contain a mixture of text and other elements, do so by specifying the element types along with #PCDATA in a 0-or-more list, e.g., (#PCDATA | Author | Editor)*.

XML Coverage

Not covering: Namespaces, XLink, XPointer
Covering later: XPath, XQuery
Probably will not cover: CSS, XSLT, XML Schema

Relational Model versus XML

(First three are same as earlier)

Relations: Schema must be fixed in advance
XML: Does not require predefined, fixed schema
Relations: Rigid flat table structure
XML: Flexible hierarchical structure; graphs
Relations: Easy to understand, simple query language
XML: Can be harder to understand, more complex query language
Relations: Ordering of data not relevant (tuple ordering or attribute ordering)
XML: Ordering forced by document format, may or may not be relevant
Relations: Transmission and sharing can be problematic
XML: Designed for easy representation and exchange
Relations: "Native" data model for all current serious commercial DBMSs
XML: "Add-on," often implemented on top of relations