WSQ: Web-Supported (Database) Queries
Roy Goldman,
Jennifer Widom
Run The Demo
WSQ (pronounced "wisk") is a new approach for combining the strengths
of existing Web search engines and RDBMS technology.
With WSQ, we can enhance SQL queries over a local relational database
with relevant searches over Google, AltaVista or any other search engine.
Our online demo
allows users to ask a set of restricted yet interesting WSQ queries.
Users can rank tuples in local databases based on how often they
appear on the Web, and optionally users can rank the tuples based on
how often they appear along with arbitrary search terms.
In this demo, all searches are issued to AltaVista.
As a simple example, we can rank the ACM SIGs by how often they
appear on the Web near Knuth:
- Rank field: Select ACM SIGs.
- Near field: Type Knuth.
- Click Search the Web.
Now, for each ACM SIG, WSQ will issue a search to AltaVista to count
how often that SIG appears on the Web near Knuth. The results
will look something like this:
For each SIG, the red number reflects the total number of Web pages
for each SIG (as given by AltaVista). You can click on each SIG in the
results to see the actual URLs supplied by AltaVista for that SIG.
You can try the demo now, try
out some sample
queries, or read ahead for more detailed instructions.
- Rank: Select one of several
small local database tables in the Rank
field. Choices include U.S. states, European countries, and ACM
Special Interest Groups (SIGs). Click the Preview Local
Database button to examine the contents of each table (without yet
consulting the Web). The Identifier column is the primary text
string assumed to identify the tuple on the Web; optionally, the
Secondary Identifiers column is an additional disjunctive search
expression that is useful for identifying the tuple. For example,
among the Stanford DBGroup members, "Jeff Ullman" is the
primary identifier, and "Jeffrey Ullman" or "Jeffrey D. Ullman" or
"Jeff D. Ullman" is the expression that constitutes the Secondary
Identifiers.
- Near: Optionally specify in the
Near field keywords to be searched for
along with each tuple in the local database. Suppose you select the
ACM SIGs as the local database. If you supply Knuth in
the Near field, then you're creating a
query to rank the ACM SIGs by how often each SIG appears on the Web
near Knuth. If you leave the Near field empty, then you're creating a
query to measure the pure popularity of each SIG on the Web,
independent of context. If the Near
field is not empty, two additional options are available:
- Correlation: Correlation
between each tuple and the Near
expression can be tight or loose. Under tight
correlation, the Near expression must
appear on the Web in close textual proximity to each tuple identifier
(implemented by using the AltaVista near operator in the
search). If correlation is loose, we only require that the Near expression and the tuple identifier
appear anywhere together on the same Web page (implemented by using
the AltaVista and operator in the search).
- Rankings: Rankings can be
absolute or normalized. With absolute rankings, tuples
will be ranked simply by the number of times they appear on the Web
together with the Near expression. With
normalized rankings, the number of Web hits for the expression is
normalized by the number of times the tuple appears on the Web without
the Near expression. The motivation for
this approach is best understood by considering the U.S. States.
Ranking these states by their popularity on the Web (without a Near expression) shows that some states (such
as California, Texas, and New York) appear far more often than others.
Now suppose we want to rank each state by how often it appears on the
Web near the keyword crime. With absolute rankings, the most
popular states will rank highly again since there are just so many
more Web pages for those states. But we may really be interested in
the relative importance of crime to each state--that is, how
often the word crime appears near a state relative to the
total number of times the state is mentioned. Note that our
normalization algorithm currently has its own limitations--it tends to
quickly "disqualify" the most popular states from any search.
- Search the Web: Click Search the Web to issue your WSQ
query.
WSQ is described in more detail in WSQ/DSQ: A Practical
Approach for Combined Querying of Databases and the Web (Postscript) (Acrobat). This
paper will appear in Proceedings of the ACM SIGMOD International
Conference on Management of Data in May, 2000. Other questions or
comments? Please contact Roy Goldman, royg@cs.stanford.edu.