CS145
Lecture Notes (1)
-- Introductory Material, Data Models
This lecture material complements the assigned
readings:
- Chapter 1
of the textbook, and the Introduction to "SQL for Web Nerds" by Philip
Greenspun (linked from the Course Schedule).
- For Relations: textbook section
3.1
- For
XML: textbook section 4.7, online XML and DTD tutorials linked from the
schedule
Database Management System (DBMS):
Provides
efficient, convenient, and safe
multi-user
storage of and access to massive amounts of persistent
data
Most familiar use: many Web sites rely heavily on
DBMS's. (And you
will build one!)
Examples: (solicit from class)
Non-Web examples of DBMS's: (solicit from class)
Example: Online Bookseller
Data = information on books (including categories, bestsellers, etc.),
customers, pending orders, order histories, trends and preferences,
etc.
- Massive: many gigabytes at a
minimum for medium-size
bookseller, more if keep all order histories over all time, even more
if keep clickstream logs, even more if keep images of book covers and
sample pages
=> Far too big for memory
- Persistent:
data outlives programs that operate on it
- Multi-user:
many people/programs accessing same database, or
even same data, simultaneously
=> Need
careful controls
Multi-user
example and discussion:
Jane and John both have ID number for bookseller gift
certificate
(credit) of $200 they got as a wedding gift.
Jane @ her office: orders "Database Systems: The
Complete Book" ($75)
prompt user for credit ID;
get credit from database;
if credit >= 75 then
credit := credit - 75;
issue order to mail book;
put new credit into database;
else print "sorry"
John @ his office: orders "A First Course in Database Systems" ($65)
prompt user for credit ID;
get credit from database;
if credit >= 65 then
credit := credit - 65;
issue order to mail book;
put new credit into database;
else print "sorry"
Initial credit = $200
Ending credit = ???
Appears similar to concurrent programming
problems (synchronization,
semaphores, etc.)
BUT: data not main-memory variables
Appears similar to file system
concurrent access
BUT: want to control at smaller granularity
Also database may be
distributed, replicated, etc.
- Safe:
(1) From system failures (2) From
malicious users (3) From inconsistent state
- Convenient:
Simple commands to - find a specific book,
list all books in a certain category and price range, generate an
order history, produce sales figures grouped by state, etc.
=> Also unpredicted queries should be easy
- Efficient:
Don't search all files in order to - get
price of one book, get all customers from northern California, get
bestselling books from last week
Massive data => DBMS's carefully tuned for
performance
A DBMS is a Software System
- Buy, install, set up for particular application
- Available for PC's,
workstations, mainframes, supercomputers
- Frequently used in conjunction
with application
server or other middleware
- Basic software infrastructure:
(figure)
Major DBMS
Products and Freeware
(Fewer vendors every year...)
- Oracle
- IBM: DB2, Informix
- Microsoft: SQL Server, Access
- Sybase
- MySQL
- Postgres
All are "relational" (or
"object-relational") database systems at their core.
All commercial vendors also
provide XML support, either
built-in or built-on-top.
People
- DBMS implementor: builds system
(245, 346)
- Database designer:
establishes schema (145)
- Database
administrator: loads data, keeps whole thing running (145)
- Database user:
queries/modifies data (145)
Less Traditional Database
Applications
- Stock monitoring, air traffic, telecom call
records: real-time
data streams, historical data and queries, "active" database
- Distributed, heterogeneous
databases: collaborative design,
medical information systems, data warehousing, "querying the Web"
- Scientific data - e.g.,
satellite, sensors, X-ray: terabytes
or much more, fancy interfaces, specialized query languages
Data Models
Terminology
- Data model: general
conceptual way of structuring data
- Schema: structure of a
particular database under a certain data model
- Instance: actual data
conforming to a schema
Relational Model
versus XML: Fundamental Differences
- Relations: Schema must be fixed
in advance
XML: Does not require predefined, fixed schema
- Relations: Rigid flat table
structure
XML: Flexible hierarchical structure (also arbitrary graphs)
- Relations: Easy to understand,
simple query language
XML: Can be harder to understand, more complex query
language (if any)
Example Applications
- University records: students, courses, grades,
etc.
Relations or XML? Why?
- University Web site: news,
academics, admissions, events, research, etc.
Relations or XML? Why?
- Family tree.
Relations or XML? Why?
Application domain we use for example schemas and instances:
- Textbook: Movies
- Lectures: Booksellers, UC
admissions, etc.
- Project: Online auction
The Relational Model
- Used by most commercial Database Managements
Systems (DBMSs)
- Very simple model
- Enables simple, clean, declarative
query languages
Database = set of relations
(or tables), each with a distinct name
(Example: Student, Campus)
Each relation has a set of attributes
(or columns), with
a distinct name within its relation.
(Example: label Student, Campus)
Each tuple (or row)
in a relation has a value for each attribute.
(Example: some tuples for Student, Campus relations)
Each attribute has a type
(or domain).
ID:char(9), name:char(25), GPA:float, age:integer
We'll use atomic (indivisible) types only for now.
schema = complete
description of structure of relations in
database: relation names, attribute names, types, etc.
instance = actual contents (tuples) of
relations
Specifying the schema (on paper):
- Without types:
Student(ID,
name, GPA, age)
- With types:
Student(ID:
char(9), name: char(25), GPA: float, age: integer)
Naming conflicts:
Student(ID, name, GPA, age)
Campus(name, enrollment) <- duplicate use of "name" okay
Apply(ID, Campus) <- "ID" okay, "Campus" not okay
Null values:
- Can use special null value in
attribute of any type
- Can be problematic (e.g., for comparisons)
(Unknown age example)
Keys: A key for a relation is
a set of attributes such that no two tuples
can have the same values for all of their key attributes.
- Key values identify specific tuples.
- System may build special indexes over key values.
- Other tuples may use key values as logical
"pointers".
- Specify keys by underlining (so no good way to
specify multiple keys on paper)
(Example: keys for Student, Campus)
XML
If you're interested in reading about the XML standard in
detail,
visit:
http://www.w3.org/XML/
It will keep you busy for days. A very short
document linked from
that site and the course schedule ("XML in 10 points") is required
reading.
XML = Extensible Markup Language
- A relatively new standard for data representation
and exchange,
intended initially for the internet
- A document format: a superset of HTML, a subset
of SGML (roughly)
- Some say that XML is to data what Java is to
programming.
- Take note:
- XML can be clunky.
- The official specification is enormous, but the
basic idea is very simple. In this class we care about the basic idea.
Like SGML and HTML, basic XML consists of three things:
- Tagged elements, which may be
nested within one another
- Attributes on elements
- Text
In HTML, tags denote formatting: <Title>,
<I>, <Table>
, etc.
In XML, tags denote meaning of data: <Student>,
<Book_Title>
, etc.
XML data can be formatted using CSS (Cascading
Style Sheets)
or XSL (Extensible Stylesheet Language) to translate
XML to
HTML.
Well-formed XML
A well-formed XML document is any XML document that follows the basic
rules: single root element, matched tags, unique attribute names, etc.
Example:
bookstore data
<?xml version="1.0" standalone="yes"?>
<Bookstore>
<Book ISBN="ISBN-0-13-035300-0" Price="$65" Edition="2nd">
<Title>A First Course in Database Systems</Title>
<Authors>
<Author>
<First_Name>Jeffrey</First_Name>
<Last_Name>Ullman</Last_Name>
</Author>
<Author>
<First_Name>Jennifer</First_Name>
<Last_Name>Widom</Last_Name>
</Author>
</Authors>
</Book>
<Book ISBN="ISBN-0-13-031995-3" Price="$75">
<Title>Database Systems: The Complete Book</Title>
<Authors>
<Author>
<First_Name>Hector</First_Name>
<Last_Name>Garcia-Molina</Last_Name>
</Author>
<Author>
<First_Name>Jeffrey</First_Name>
<Last_Name>Ullman</Last_Name>
</Author>
<Author>
<First_Name>Jennifer</First_Name>
<Last_Name>Widom</Last_Name>
</Author>
</Authors>
<Remark>
Amazon.com says: Buy this book bundled with "A First Course,"
it's a great deal!
</Remark>
</Book>
</Bookstore>
A well-formed XML document can contain regular data (as above) or very
irregular data.
Valid XML
It's possible to define a kind of schema for XML data, called a
Document Type Descriptor (DTD).
A DTD is a grammar that describes the legal attributes of
tagged
elements and the legal ordering and nesting of the elements.
Example DTD
<!ELEMENT Bookstore (Book | Magazine)*>
<!ELEMENT Book (Title, Authors, Remark?)>
<!ATTLIST Book ISBN CDATA #REQUIRED
Price CDATA #REQUIRED
Edition CDATA #IMPLIED>
<!ELEMENT Magazine (Title)>
<!ATTLIST Magazine Month CDATA #REQUIRED Year CDATA #REQUIRED>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Authors (Author+)>
<!ELEMENT Remark (#PCDATA)>
<!ELEMENT Author (First_Name, Last_Name)>
<!ELEMENT First_Name (#PCDATA)>
<!ELEMENT Last_Name (#PCDATA)>
The DTD is specified at the top of the document or in a separate file
referenced at the top of the document. In both cases use
standalone="no"
.
Question: What are the benefits of using a DTD?
Question: What are the
benefits of not using a DTD?
ID and IDREF(S) Attributes
Element pointers: assign a special ID
attribute to an
element, then point to that element with a special IDREF
or
IDREFS
attribute in another element.
Example: reorganized bookstore
<?xml version="1.0" standalone="no"?>
<!DOCTYPE Bookstore SYSTEM "bookstore.dtd">
<Bookstore>
<Book ISBN="ISBN-0-13-035300-0" Price="$65" Edition="2nd" Authors="JU JW">
<Title>A First Course in Database Systems</Title>
</Book>
<Book ISBN="ISBN-0-13-031995-3" Price="$75" Authors="HG JU JW">
<Title>Database Systems: The Complete Book</Title>
<Remark>
Amazon.com says: Buy this book bundled with
<BookRef book="ISBN-0-13-035300-0" />,
It's a great deal!
</Remark>
</Book>
<Author Ident="HG">
<First_Name>Hector</First_Name>
<Last_Name>Garcia-Molina</Last_Name>
</Author>
<Author Ident="JU">
<First_Name>Jeffrey</First_Name>
<Last_Name>Ullman</Last_Name>
</Author>
<Author Ident="JW">
<First_Name>Jennifer</First_Name>
<Last_Name>Widom</Last_Name>
</Author>
</Bookstore>
DTD for this data:
<!ELEMENT Bookstore (Book*, Author*)>
<!ELEMENT Book (Title, Remark?)>
<!ATTLIST Book ISBN ID #REQUIRED
Price CDATA #REQUIRED
Edition CDATA #IMPLIED
Authors IDREFS #REQUIRED>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Remark (#PCDATA | BookRef)*>
<!ELEMENT BookRef EMPTY>
<!ATTLIST BookRef book IDREF #REQUIRED>
<!ELEMENT Author (First_Name, Last_Name)>
<!ATTLIST Author Ident ID #REQUIRED>
<!ELEMENT First_Name (#PCDATA)>
<!ELEMENT Last_Name (#PCDATA)>
Digression on (#P)CDATA
The details are messy (recall comments about clunky XML and
official specs above), but overall:
-
#PCDATA
is only for
element content, CDATA
is
only for attribute types, you cannot use them interchangeably, and
there are no keywords PCDATA
or #CDATA
.
Both
CDATA
and #PCDATA
essentially specify text where the
special characters "<", ">", and "&" have to be
escaped as
"<", ">" and "&" respectively.
- Recommendation: Use
CDATA
for string-valued
attributes, use #PCDATA
for elements
containing text. If you
want an element to contain a mixture of text and other elements, do so
by specifying the element types along with #PCDATA
in a
0-or-more list, e.g., (#PCDATA
| Author |
Editor)*.
XML Coverage
- Not covering: Namespaces, XLink,
XPointer
- Covering later: XPath,
XQuery
- Probably will not cover: CSS,
XSLT, XML Schema
Relational Model versus XML
(First three are same as earlier)
- Relations: Schema must be fixed
in advance
XML: Does not require predefined, fixed schema
- Relations: Rigid flat table
structure
XML: Flexible hierarchical structure; graphs
- Relations: Easy to understand,
simple query language
XML: Can be harder to understand, more complex query
language
- Relations: Ordering of data
not relevant (tuple ordering or attribute ordering)
XML: Ordering forced by document format, may or may not be
relevant
- Relations: Transmission and
sharing can be problematic
XML: Designed for easy representation and exchange
- Relations: "Native" data model
for all current serious commercial DBMSs
XML: "Add-on," often implemented on top of relations