Project #2 FAQ


Question: Where would you like this project to go? What suggestions do you have?
Answer: Here are some Project Hints to get you started thinking about what kinds of things you might do with this project.
Question: When I search for ``School of Business'' I first match ``School of B''.
Answer: That was a mistake in the specification. We really should have asked for a pattern that ended with white space, e.g., a blank or newline. The parser.h file describes codes for whitespace in the regular expressions that are allowed to be input to Nathan's parser, including {s} for any one whitespace character. Added later: It was pointed out that we then miss occurrences at the end of a sentence or that are followed by a comma or other punctuation. Thus, you may wish to end with an expression like ({s}|[\.\,;]). Remember a real period is represented by \. and likewise for comma. Added much later: It was also pointed out that a tag is a logical ender for the expression we are looking for. Thus, including < as a possible ender character makes sense too.
Question: There are ctrl-M's in the Web text that make it hard to read, and that the parser fails to recognize as newlines.
Answer: Apparently some of the Stanford Web was created using Windows (shame on them), and Microsoft, in one of its early attempts at incompatibility with UNIX used ctrl-M (carriage return), rather than the ASCII newline control character ('\n' in C) to separate lines. One of the students, B.C. Wong, suggested the following UNIX ``translate'' command:

     tr '\r' '\n' <inputFile >outputFile

to replace the carriage-returns ('\r') by newlines. I actually did that for the truncated files, which are now available through the Web as x1000.txt and so on. I was not able to do it for the entire 100Mb file, because we are not allocated enough space. However, if you feel the ctrl-M's are giving you trouble, you can write project #2 to take the data as its standard input, and use the UNIX ``pipe'' symbol | to have the translation done piecemeal as the input to your program is read. The idea is

     tr '\r' '\n' </usr/class/cs154/WWW/w.txt | yourProj2...