CS145 Lecture Notes (13) -- Data Warehousing and Data Mining

Two broad types of database activity:

OLTP: On-Line Transaction Processing

Short transactions, both queries and updates (e.g., update account balance, enroll in course)
Queries are simple (e.g., find account balance, find grade in course)
Queries and updates touch small portion of data (e.g., examples above)
Updates are frequent (e.g., movie tickets, seat reservations, shopping carts)

OLAP: On-Line Analytical Processing

Long transactions, usually complex queries (e.g., all statistics about all sales, grouped by dept and month)
Queries touch large portion of the data (e.g., as above)
"Data mining" operations
Infrequent updates

Data Warehousing

Bring data from "operational" (OLTP) sources into a single warehouse to do analysis and mining (OLAP).

(system figure)

Also referred to as Decision Support Systems (DSS)

=> Extremely popular in large corporations today. Many have spent millions in data warehousing projects.

Example: Wal-Mart

All distributor, store, and sales information copied into warehouse. Used to make store layout and supplier decisions. Hundreds of terabytes of data.

Example: Victoria's Secret

All sales information copied into warehouse. Used to supply appropriate merchandise (current and future) to appropriate outlets. Warehouse project paid off in 1-2 years.

Technical challenges:

Extracting data from operational sources in useful format
(add to figure)
Transforming, "cleaning" ("scrubbing"), and possibly summarizing operational data
(add to figure)
Integrating data from multiple sources
(add to figure)
Keeping warehouse up-to-date as source data changes

Data at Warehouse

Most warehouse applications are of similar character with two kinds of data:

Fact data: sales transactions, flight arrivals, course enrollments
Updated frequently, often append-only, very large
Dimension data: store items, store customers, students, courses
Updated infrequently, not as large

Star Schema

One fact table referencing several dimension tables
Example:

  Sales(StoreID, ItemID, CustID, qty, price)  // fact table
  Store(StoreID, city, state)
  Item(ItemID, name, brand, color, size)
  Customer(CustID, name, address)

(diagram)

In fact table:

Dimension attributes: foreign keys to dimension tables
Dependent attributes: all others, often aggregated in queries

Complete star join:

  SELECT *
  FROM   Sales, Store, Item, Customer
  WHERE  Sales.StoreID = Store.StoreID
  AND    Sales.ItemID = Item.ItemID
  AND    Sales.CustID = Customer.CustID

Typical OLAP query will:

Do all or part of star join
Filter interesting tuples based on fact and/or dimension data
Group by one or more dimensions
Aggregate the result

Example: Find the sum of all sales in California of blue items with item price > 100, grouped by store and customer

Performance:

OLAP queries can be extremely slow
New kinds of indexes
New query processing techniques
Systems make extensive use of materialized views

Question: Why are materialized views appropriate in this setting?

Data Cubes

Also called "Mutidimensional OLAP"
Idea:

Dimension data forms axes of "cube"
Fact data in cells
Aggregated data on sides, edges, and corners

N-dimensional version of spreadsheet

(diagram)

Fact table uniqueness:

If dimension attributes are not a key, cells must aggregate (e.g., sum of qty*price)
"Date" often used to create a key: Sales(StoreID,ItemID,CustID,date,qty,price)

Question: Is date a dimension or dependent attribute?

Let's keep things simple: Sales(StoreID, ItemID, CustID, price)

Queries can "slice and dice", "drill down and roll up"

Slice: Select on one or more dimensions
(e.g., only store S2, only stores in California)
Dice: Differing definitions
Constrain all dimensions (e.g., male customers buying clothes in California), or
Partition or group on one or more dimensions (e.g., stores by state and items by color)
Drill down ("de-aggregate"): Examining summary data, break it out by dimension attribute
Example: Looking at sum of all California sales, break it out by store
Roll up ("aggregate"): Examining data, summarize along some dimension
Example: Looking at data grouped by item and customer, aggregate so only grouped by customer

Performance:

Data cube can be huge
Also can be sparse
Can compute in advance, compute on demand, or some combination

SQL Constructs (`CUBE` and `ROLLUP`)

Textbook describes earlier proposed standard: CUBE(R) in FROM clause
These notes cover the actual standard: CUBE and ROLLUP in GROUP-BY clause

Adding "WITH CUBE" to a GROUP-BY query expands the query result into a full data cube:

  SELECT StoreID, ItemID, CustID, SUM(price)
  FROM Sales
  GROUP BY StoreID,ItemID,CustID WITH CUBE

Alternative syntax: "... GROUP BY CUBE(StoreID,ItemID,CustID)"

All result tuples, plus all cube summary tuples over result. For example:

(store123, NULL, cust456, $1000)
$1000 is sum of price for all items bought by Customer cust456 at store store123
Face of cube
(NULL, item789, NULL, $10,000)
$10,000 is sum of price for item item789 bought by any customer at any store
Edge of cube
(NULL, NULL, NULL, $1,000,000)
$1,000,000 is sum of price in entire database
Corner of cube

CUBE queries useful for data browsing, also for materialized views:

  CREATE MATERIALIZED VIEW SalesCube AS
    SELECT StoreID, ItemID, CustID, SUM(price) as p
    FROM Sales
    GROUP BY StoreID,ItemID,CustID WITH CUBE

Example query using view: Find total sales of all blue items in California

CUBE construct enables efficient data cube operations built on top of conventional relational DBMS.
Can write the same queries without CUBE. Often more complex, less efficient.

Note: We won't be covering the related construct WITH ROLLUP, which creates a subset of a cube and is used when dimension attributes form a hierarchy (e.g. storeID, regionID), so some of the aggregates aren't needed.

Data Mining

Search for patterns in large databases

Often performed over data warehouses
Both less structured and less ad-hoc than OLAP queries

Classic application: "market basket" data

  Purchase(salesID, item)
  ...
  (3, bread)
  (3, milk)
  (3, eggs)
  (3, beer)
  (4, beer)
  (4, chips)
  ....

Want to find association rules

  {L1,L2,...,Ln} -> R

"If a customer bought all the items in set {L1, L2, ..., Ln}, he is very likely to also have bought item R."

Example:

  {bread, milk} -> eggs
  {diapers}     -> beer

Question: Can we write a SQL query to find association rules?

Question: What can we write it in?

Two concepts for association rules: {L1,L2,...,Ln} -> R

Support: {L1,L2,...,Ln} must appear frequently
Support = #sales containing {L1,L2,...,Ln} / total #sales
Confidence: probability that R appears if {L1,L2,...,Ln} does
Confidence = #sales containing {L1,L2,...,Ln,R} / #sales containing {L1,L2,...,Ln}

Goals of data mining:

Quickly find association rules over extremely large data sets
(e.g., all Wal-Mart sales for a year).
Allow user to tune support and confidence.

Other types of data mining rules and patterns:

Classification trees (= decision trees)
Buyers(<attributes>, purchase)
Want to predict purchase from <attributes>
Clustering
Buyers(<attributes>)
Automatically group buyers into N similar types
Optimization
CreditCardCustomers(<attributes>)
Select card features to maximize revenue
Time-Series Analysis
StockPrice(date, price)
Predict future price of the stock