My imitation of the POS-tagger at www.nlpwp.org was finished today and everything seems to work. Seems. The problem is that the nlpwp tagger was written for the sake of instruction and is not supposed to be a working application. The success of frequency-based tagging is evaluated, but the performance of the transformation-based tagging remains unclear, especially between consecutive correction rules.

The Plan is to write my own flavour of the nlpwp tagger with some small differences. I’m still just brainstorming, so bear with me:

  • use three different versions of the Brown Corpus
    1. a tagged file to build a model (trainFile)
    2. an untagged file to test the model upon (testFile)
    3. a tagged file to evaluate the tags generated using the model on the testFile (evalFile). This is the same as the testFile, except that it’s been tagged
  • maintain a clear difference between training, testing and evaluating the tagger in the Haskell code
  • try to use a data type consistently for the list of tokens/tags (instead of a list, Zipper, Map…)
  • check whether it is possible to use the tagger for several corpuses.

I’m not saying that my way will be better, but I think I have to try this to get a better understanding of how the tagger works and what its weak spots are. To be continued…