Brown Corpus

Backtrack to the horizon, you Road Runner!

Do the backtrack Whenever exploring the Canadian wild, you should sing to scare off bears, wolves, cougars and squirrels. You should also backtrack now and again, just to make sure that you are where you think you are. Brill’s tagger is not exactly a prime example of Canadian scenery and I do not sing while...

Va, vis, deviens

Instead of writing code like crazy, I’m reading documentation on Brill’s tagger. The first interesting text is Eric Brill’s dissertation A Corpus-Based Approach to Language Learning, where he describes a method for transformation-based tagging. It seems that a few of my questions are being answered in his thesis, so I will work my way through it...

When God Created the Coffee Break

I haven’t been posting a lot lately, but I have been working on the POS tagger. Trust me :-). The last few weeks, I put emphasis on rewriting the tagger built by Daniël de Kok and Harm Brouwer (www.nlpwp.org, based on an article by Eric Brill). I tried to understand the functions and capture them...

Travel is Dangerous

I haven’t posted something for a while, but I have been programming and rewriting parts of the posTagger from www.nlpwp.org: the tagger now works with three different files (training, testing, evaluation) it is trained based on the frequency of tags and adds the most common tag (“NN”). Next step is the implementation of transformational rules,...

The Plan

My imitation of the POS-tagger at www.nlpwp.org was finished today and everything seems to work. Seems. The problem is that the nlpwp tagger was written for the sake of instruction and is not supposed to be a working application. The success of frequency-based tagging is evaluated, but the performance of the transformation-based tagging remains unclear,...

Taggart

In this post, we we will: present the last two functions for our tagger wrap the functions in some IO paper go for a test drive. Finishing touches The first function is simply a composition of tokenMostFreqTag and tokenTagFreqs to make our life a bit easier: trainFreqTagger :: [TokenTag] -> M.Map Token Tag trainFreqTagger =...

I am here to accumulate you

In the last post, we were building a POS (part of speech) tagger based on the work at www.nlpwp.org. We were already able to distil a Map from a corpus containing its tokens, their tags and the frequencies. Next, we need a function to fold through the Map and produce the tag with the highest...

Cliffhanger

The seventh chapter in nlpwp.org concerns part of speech tagging. What’s that? Well, imagine you have a corpus and you want to add tags to its words (or ‘tokens’) explaining their function. “The” is an article, “president” is a noun, “goes” is a verb etc. You could do it all by hand, but you could...

Full circle

It’s been a long trip, but here we are. Back at the beginning. The first, the last, my frequency. The first few weeks of this blog have basically been a laboratory and playground. Although it is not very hard to find the frequency of an n-gram in a corpus, we took some detours following the...

Where’s the Map?

In the previous posts, the frequency of n-grams in a given corpus was calculated using suffix arrays. Although this works well, I wondered if there was a more accessible way to find a string’s frequency. The Map type is being used a lot, so I rewrote some previous code using a Map: import qualified Data.Map...

It’s the economy, stupid

Playing around with suffix arrays and binary search functions was fun, but when we used them in combination with the Brown Corpus, they turned out to be far too slow to find the frequency of a word. In the previous post, I gave two possible reasons: building a suffix array of one million words takes...

What’s the frequency, Kenneth?

The last few days have been quite hectic on this blog, with a lot of code being posted and lots of new things. Today, I’d like to start with a small recap of what has been happening here the last few weeks. The blog started off with the Brown Corpus: open it, return some statistics,...