CS145 Lecture Notes -- Data Warehousing and Data Mining



Two types of database activity:

OLTP: On-Line Transaction Processing

OLAP: On-Line Analytical Processing

Data Warehousing

Bring data from "operational" (OLTP) sources into a single warehouse to do analysis and mining (OLAP).

(system figure)

















Also referred to as Decision Support Systems (DSS)

=> Extremely popular in large corporations today. Many have spent millions in data warehousing projects.

Example: Wal-Mart

Example: Victoria's Secret

Technical challenges:
  1. Extracting data from operational sources in useful format
    (add to figure)

  2. Transforming, "cleaning" ("scrubbing"), and possibly summarizing operational data
    (add to figure)

  3. Integrating data from multiple sources
    (add to figure)

  4. Keeping warehouse up-to-date as source data changes

Data at Warehouse

Most warehouse applications are of similar character with two kinds of data:

Star Schema

One fact table referencing several dimension tables
Example:
  Sales(StoreID, ItemID, CustID, qty, price)  // fact table
  Store(StoreID, city, state)
  Item(ItemID, name, brand, color, size)
  Customer(CustID, name, address)
(diagram)










In fact table: Complete star join:
  SELECT *
  FROM   Sales, Store, Item, Customer
  WHERE  Sales.StoreID = Store.StoreID
  AND    Sales.ItemID = Item.ItemID
  AND    Sales.CustID = Customer.CustID
Typical OLAP query will:
  1. Do all or part of star join
  2. Filter interesting tuples based on fact and/or dimension data
  3. Group by one or more dimensions
  4. Aggregate the result
Example: Find the sum of all sales in California of blue items with item price > 100, grouped by store and customer















Performance:

Question: Why are materialized views appropriate in this setting?





Data Cubes

Also called "Mutidimensional OLAP"
Idea: N-dimensional version of spreadsheet

(diagram)












Fact table uniqueness:

Question: Is date a dimension or dependent attribute?



Let's keep things really simple: Sales(StoreID, ItemID, CustID, price)

Queries can "slice and dice", "drill down and roll up"

Performance:

Relational representation of cube

CUBE(Sales): All tuples of Sales + All tuples containing one or more "*" in place of dimension attributes

Example:

Example query: Find total sales of all blue items in California















Data Mining

Search for patterns in large databases Classic application: "market basket" data
  Purchase(salesID, item)
  ...
  (3, bread)
  (3, milk)
  (3, eggs)
  (3, beer)
  (4, beer)
  (4, chips)
  ....
Want to find association rules
  {L1,L2,...,Ln} -> R
"If a customer bought all the items in set {L1, L2, ..., Ln}, he is very likely to also have bought item R."

Example:

  {bread, milk} -> eggs
  {diapers}     -> beer

Question: Can we write a SQL query to find association rules?






Question: What can we write it in?



Two concepts for association rules: {L1,L2,...,Ln} -> R Goals of data mining:
Other types of data mining rules and patterns: