It works

The last few weeks, I have been finishing the first version of the posTagger. And I’m thrilled to announce that it works. The posTagger currently:

  • Analyses a training file and builds a list with tokens and their most frequent tag.
  • Tags the tokens in the patching file. Unknown tokens are tagged as nouns (“NN”).
  • Builds a list of possible transformation rules to correct tags, based on the patching file.
  • Finds the most effective transformation rule for a given state, changes the state of the patching file using that rule and calculates the percentage of correct tags.
  • Repeats this last step until no more improvement is possible.

There are 14836 rules in the transformation list. It takes quite a while to find the most effective transformation rule every time and it would thus take days or weeks to run the application. This is countered by isolating the 100 and 1000 most frequent transformation rules respectively. We then calculate the most effective rule in the list and change state. This yields the following results (expressed in percentage of correctly tagged tokens in the patching file):

  • basic tagging with the frequency list: 87.66183 %
  • tagging unknown tokens as nouns: 88.5266 %
  • try the 100 transformation rules: 90.993805 % in 30 rules
  • try the 1000 transformation rules: 92.19936 % in 106 rules.

I stopped running the posTagger at state 106 because it took way too long, so there is still room for improvement there. The basic functionality of the tagger seems to work, and the following weeks I will try to find a solution to the following issues:

  1. What is an efficient number of transformation rules to start off with? A list of 100 rules runs fast but results in fewer corrections, while a list of 1000 rules runs slow but returns more corrections. What is the ideal number here? I need a better view on the results in each state to determine this.
  2. Are some rules applied twice? If not, maybe we can delete each rule we used from the list and let the tagger run faster as it progresses.
  3. I only implemented three patch templates of Brill’s original paper (‘tag before’, ‘tag after’, ‘tag before and tag after’). I should implement all of them.

One More Thing

Oh… Here’s Jigsaw Falling Into Place (Thumbs Down Version).
I started on my new job a few days ago. From Steve Jobs’ Stanford commencement speech:

“Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven’t found it yet, keep looking. Don’t settle. As with all matters of the heart, you’ll know when you find it. And, like any great relationship, it just gets better and better as the years roll on. So keep looking until you find it. Don’t settle.”