CS154 Project #1

Due Wednesday, February 2, 2000, 3:15PM

Overview of Course Project

The project is in three phases.

  1. A program that compiles regular expressions into transition tables for either a DFA or an NFA (this assignment). This program will be tested on some ``ugly'' RE's that produce very large DFA's, so you should make a choice for each RE. If you simply turn everything into an NFA, you will be too slow on simple searches once we get to part (3).

  2. A program that interprets the tables you created in part (1). You should be able to process very long strings of input.

  3. A demonstration of what you can do with your FA interpreter from part (2). We shall give you a large body of text (10Mb - 100Mb), perhaps the Stanford Web pages or a similar repository. You should ``data-mine'' something from the data by designing some regular expressions and searching for occurrences of substrings that match those expressions. Think, for example, about examining Web pages from all Stanford departments and finding course numbers and their instructors, although it will be up to you what to search for.

Programming Venue

Despite what you may have heard, NT is not taking over the world; Linux will dominate, and every computer scientist should assume that UNIX is and will remain the platform of choice. Thus, we assume you will write your code using one of the UNIX clusters, e.g., the ``elaines'' or ``epics.'' We also assume that you will write your code in C or C++; Java is acceptable, but you may lose some support that has been provided for C/C++ programming.

If you insist on programming in your dorm on a PC or MAC, then remember that the code has to run on a very large body of data in part (3). The files will appear in the class directory, and it is your responsibility to be able to access them.

Regular-Expression Parsing

Thanks to Nathan Folkert, there is a parser for regular expressions in subdirectory /usr/class/cs154/project. Check the file parser.h for a description of the parser and its input language. The file parser.c has the C code for the parser. You can just include it in your program (you don't even have to acknowledge this use; we'll assume it), or modify it should you like (again, no acknowledgement is necessary).

This code takes a regular expression of the type we expect you to process and produces a parse tree, i.e., a tree of structs, each of which represents a node of the tree; i.e., an operator or operand of the expression. This code has been compiled under gcc; we hope it will compile under other C compilers, but it has not be tested. You should report any problems to cs154-help@lists, and we shall try to adapt to what people are actually using, but there are no guarantees.

The Assignment

You should write a program that accepts a file containing a regular expression in the extended (UNIX-style) notation including simple character classes (bracketed lists of characters and ranges of characters like a-z). You should also allow the ? and positive closure (superscript +) operators. These extensions are described in Section 3.3.1 of the course reader.

Since we do not have the easy ability to distinguish superscripts from in-line characters, use | as the ``or'' operator and + as the superscript + (positive closure) operator. See the file project/parser.h for details, even if you do not use the code provided.

The output of your program should be a structure representing either a DFA or NFA for the same language as the input RE. You should design your structure carefully, because it will be used in part (2).

Some data-structure things to think about: Roughly, an automaton is a set of states. This set could be represented as an array or a list. A state is a set of input/next-state(s) pairs. Your automata will have a large input alphabet, perhaps all 128 ASCII characters, or the printable subset (i.e., those that appear on a typical keyboard). You might, for example, plan to ignore all characters that don't form part of the printable set of ASCII characters. How will you represent the transitions? By a list of symbol-state(s) pairs? By an array indexed by ASCII characters? Will you represent character classes specially, e.g., by handling the common situation where a state q goes to state p on a large number of different inputs such as all letters? We're not trying to tell you what should be done; we just want to warn you to think about these issues, since you will eventually have to simulate automata in whatever data structure you choose.

What to Hand In

  1. Your code. You may be able to submit this on-line. Details later.

  2. A writeup telling us:

    a)
    What strategy you used to decide whether to generate a DFA or NFA.

    b)
    The essential ideas behind your representation(s) of automata.

    c)
    Several examples of RE's that you compiled and the result. You may wish to develop a way to display the contents of your automaton data structures in a ``human-readable'' way.

    d)
    We shall also distribute several hard examples of RE's, and you should describe the results on these expressions: how long did it take; how big were your data structures?