I haven’t been posting a lot lately, but I have been working on the POS tagger. Trust me :-). The last few weeks, I put emphasis on rewriting the tagger built by Daniël de Kok and Harm Brouwer (www.nlpwp.org, based on an article by Eric Brill). I tried to understand the functions and capture them in my own words. In the process, I checked my results against theirs, such as the percentage of correct/empty/incorrect tags.

All results were identical, until today. There seems to be a slight problem with the nlpwp tagger. Daniël de Kok and Harm Brouwer roughly do the following:

  1. Create a model of the most frequent word/tag couples based on a tagged training file.
  2. Run the model on an untagged patch file.
  3. Tag all unknown words in the patch file with “NN”. Then check it against the tagged version of the file.

The logical next step would be to extract transformational rules from the patch file to “patch” the model, as has been done in the original article. Instead, they seem to run the model against the training file, tag unknown words with “NN” and look for transformation rules. This seems to be a bug and it is odd for two reasons.
Firstly, there is no reason to run the model against the data it has been trained on. You know that the model contains every word in the data because its contents was obtained from the data: you’re moving in a circle. As a consequence, there is no need to tag unknown words with “NN” because there are no unknown words (cf. nlpwp.org itself, under 7.3. Evaluation).
Secondly, extracting the transformation rules from this file will neglect the possibility of unknown words being tagged with “NN”. I suspect they will not work that well, which still has to be tested.

Instead, I think one should extract transformational rules from the patch file, get the most efficient ones, and then test the whole tagger on a third untagged file. I guess that the results from this test will give a good idea of the tagger’s strength. In fact, this has already been done by Eric Brill, the writer of the original article, who created the Brill tagger.

My Haskell version of the tagger will use three files (as described in the original article): a training file, a patch file and a test file, containing 90 %, 5 % and 5 % of the Brown Corpus, respectively.

I’ll keep you posted.