Do the backtrack

Whenever exploring the Canadian wild, you should sing to scare off bears, wolves, cougars and squirrels. You should also backtrack now and again, just to make sure that you are where you think you are.

Brill’s tagger is not exactly a prime example of Canadian scenery and I do not sing while programming, but I did backtrack my trail of the last few weeks. I wrote functions to export two csv files:

  1. the list of tokens and their most frequent tags, distilled from the training file (view)
  2. the list of the transformation rules and their frequency (view).

I just thought that they might come in useful as landmarks when debugging or exploring the horizon of natural language processing.

Event Horizon

We now have a list of all transformation rules, so let’s transform like crazy! Where is Optimus Prime?

Well… It’s not quite that simple. When applying a transformation rule, it will correct errors in the proposed tags. However, it will also cause a number of new errors. The most successful rule is thus the one with the greatest net improvement, calculated by subtracting the caused errors from the corrected ones.
After finding the most effective rule, we apply it (“change state”) and look for the best rule again. After which we change state, look for the best rule for that state, change state and… You get the idea.

This procedure looks simple, but with 14.836 rules it better be fast. Like Road Runner fast. And we should always check whether there still is a net improvement. We could limit the set to the 100 most common rules, but this blog is also a zen exercise in dealing with huge amounts of data. The 100 most common rules is for cowards. For the time being :-)

So let’s face the horizon and just start walking, shall we?