============================================================================ CS 346 Fall '98 - Prof. Widom - Handout #9 LECTURE NOTES - Query processing lecture #1 ============================================================================ Steps in database query processing ---------------------------------- Query string --> Parser Query tree --> Checker Valid query tree --> View expander Valid tree w/o views --> Logical query plan generator Logical query plan --> Query rewriter (heuristic) Better logical plan --> Physical query plan generator (cost-based) Selected physical plan --> Code generator Executable code --> Execution engine Running example --------------- Student(ID, Name, Major) // ID is key Course(Num, Dept) // Num is key Taking(ID, Num) // (ID,Num) is key Query to find all EE students taking at least one CS course: SELECT DISTINCT Name FROM Student, Course, Taking WHERE Taking.ID = Student.ID AND Taking.Num = Course.Num AND Major = 'EE' AND Dept = 'CS' Parser ------ Produces tree representation of the query string: Checker ------- Verifies query tree against database schema: - All tables in FROM clause exist - All columns of tables exist - No ambiguities in table references or unqualified attribute references (table names usually added at this point) - All comparisons, aggregations, etc. are type-compatible View expander ------------- Suppose Student is actually a view: StudName(ID, Name) StudMajor(ID, Major) CREATE VIEW Student AS SELECT StudName.ID, StudName.Name, StudMajor.Major FROM StudName, StudMajor WHERE StudName.ID = StudMajor.ID Original query becomes: Or "flattened": Logical and physical query plans -------------------------------- - Both are trees representing query evaluation - Leaves of the tree represent data - Internal nodes of the tree are "operators" over the data - Logical plan is higher-level and algebraic - Physical plan is lower-level and operational - Logical plan operators correspond to query language constructs - Physical plan operator correspond to implemented "access methods" Logical plan operators ---------------------- - Extended relational algebra - Leaves of logical plans are table names - Basic operators: SELECT, PROJECT, CROSS-PRODUCT, UNION, DIFFERENCE - Abbreviations: NATURAL-JOIN, THETA-JOIN, INTERSECT - Extensions: RENAME, AGGREGATE/GROUP-BY, DISTINCT (+ others) Logical query plan for running example query (3-way cross-product): Using 2-way joins instead of 3-way cross-product: Logical plan generation ----------------------- Usually there's a straightforward mapping from a valid parse tree to a "naive" logical query plan. The logical plan may then be rewritten to a "better" one. Push selections down in previous example: Physical plan operators ----------------------- - Leaves of physical plans are usually table names, sometimes indexes or other information. - Access methods for single tables: * TABLE SCAN * INDEX SCAN W/O CONDITION * CONDITION-BASED INDEX SCAN - Join methods: * NESTED-LOOP JOIN (various algorithms/improvements) * SORT-MERGE JOIN * HASH JOIN (various algorithms) - Other important operators: * SORT * GROUP * AGGREGATE (may be combined with GROUP) * UNION, EXCEPT, INTERSECT - Operators often merged with other ones: * SELECTION (APPLY-PRED) * PROJECTION * DISTINCT - In a distributed system: * SHIP - In a parallel system: * EXCHANGE Possible physical plan for running example query: Another possible physical plan: Physical plan generation ------------------------ There are lots and lots of possible physical query plans for a given logical plan. The physical plan generator (sometimes called "physical plan enumerator") tries to select the one that will be "optimal", usually with respect to response time or in some cases throughput. => This is the meatiest part of query processing!!! Code generator -------------- Translates physical query plan tree into executable code. Very system-specific -- some systems may instead use a query plan interpreter. Some terminology ---------------- "Query processing" = entire process "Query optimization" = rewriter + physical plan generator "Query compilation" = entire process through code generation "Query execution" = execution engine Coming up --------- - Query rewriting - Passing information between physical operators - Cost metrics for physical query plans - Physical query plan enumeration - Reducing the search space - Further optimizations Exercise -------- Consider the following modification of our running example: Student(ID, Name, Major) // ID is key Course(Num, Dept) // Num is key Took(ID, Num, Quarter) // (ID,Num) is key Query to find all EE students who took at least one CS course in the fall quarter 1998: SELECT * FROM Student, Course, Took WHERE Took.ID = Student.ID AND Took.Num = Course.Num AND Student.Major = 'EE' AND Course.Dept = 'CS' AND Took.Quarter = 'Fall98' Suppose: - The DBMS supports NESTED-LOOP JOIN, SORT-MERGE JOIN, and HASH JOIN - Each table can be accessed either with a TABLE SCAN or a CONDITION-BASED INDEX SCAN (assume all relevant attributes are indexed) - For simplicity assume initially that the system will not perform a NESTED-LOOP JOIN using an index on the inner relation's join attribute Note: join associativity matters Q: How many possible physical query plans are there for this query? A: Q: What if the system also supports NLJ with inner index? Case (a): System can exploit 2 indexes on a table Case (b): System can't exploit 2 indexes on a table