In traditional database management systems (DBMS) the schema, which is a description of the data stored by the DBMS, must be defined in advanced. All data must be well structured and conform to this predefined schema.
Lore (Lightweight Object Repository) is a new database system designed to support storage and queries of semistructured data. In Lore, data is self-describing, so it does not need to adhere to a schema fixed in advance. Lore is particularly well-suited for document data, including HTML documents available on the World Wide Web.
The demo will showcase the Lore system. We will introduce the Lore Language (Lorel) and contrast the system with Object Oriented and Relational DBMS.
The advent of the World Wide Web on the Internet allows for efficient sharing of medical information. However, with this new ease of communicating information, new issues of security and privacy of the data arise. To address both the need for efficient data sharing and the requirement for respect of privacy, we have developed and implemented the TIHI system, a rule-based, human-monitored software entity that helps medical institutions share data with legitimate partners via the World Wide Web. Customers outside a medical institution can submit queries to the Security Mediator, the software core of the system which not only performs access control but also subjects the results to content validation. In our model, the Security Mediator receives instructions from and remains under the supervision of the Security Officer, a person ultimately responsible for enforcing the hospital or clinic policy for data security within the medical institution.
The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. People are likely to surf the web using its link graph, often starting with high quality human maintained indexes such as "Yahoo!" or with search engines. Human maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics. Automated search engines that rely on keyword matching usually return too many low quality matches. To make matters worse, some advertisers attempt to gain people's attention by taking measures meant to mislead or "spam" automated search engines.
The citation graph of the web is an important resource that has largely gone unused in existing web search engines. We have created maps containing as many as 518 million of these hyperlinks, a significant sample of the total. These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results.
Although PageRank draws on the academic citation literature, it depends on properties of the web that are not present in typical academic citations. In addition, we will discuss how PageRank measures how often each web page is visited according to an idealized model of user behavior. We will demonstrate the difficulty of artificially inflating a page's PageRank as advertisers may attempt to do. We will describe expected discrepancies between a web page's actual usage count and its PageRank.
A prototype that contains the PageRank for 16 million pages and searches over web page titles will be demonstrated. The attendees will be asked to offer queries to help us demonstrate the system and assess the quality of the search results. We will also demonstrate a web browsing accessory that graphically annotates each link in the current page with its PageRank, allowing users to visually spot the destinations with the highest citation importance.