Code

When God Created the Coffee Break

I haven’t been posting a lot lately, but I have been working on the POS tagger. Trust me :-). The last few weeks, I put emphasis on rewriting the tagger built by Daniël de Kok and Harm Brouwer (www.nlpwp.org, based on an article by Eric Brill). I tried to understand the functions and capture them...

Travel is Dangerous

I haven’t posted something for a while, but I have been programming and rewriting parts of the posTagger from www.nlpwp.org: the tagger now works with three different files (training, testing, evaluation) it is trained based on the frequency of tags and adds the most common tag (“NN”). Next step is the implementation of transformational rules,...

LittleByLittle

I wrote the first functions for the POS-tagger, all of them are based on the work at www.nlpwp.org. The types are the main difference here: Harm Brouwer and Daniël de Kok create a new data type, while I just use type synonyms. That way, the TokenTag type is basically a tuple containing two Strings. It...

Taggart

In this post, we we will: present the last two functions for our tagger wrap the functions in some IO paper go for a test drive. Finishing touches The first function is simply a composition of tokenMostFreqTag and tokenTagFreqs to make our life a bit easier: trainFreqTagger :: [TokenTag] -> M.Map Token Tag trainFreqTagger =...

I am here to accumulate you

In the last post, we were building a POS (part of speech) tagger based on the work at www.nlpwp.org. We were already able to distil a Map from a corpus containing its tokens, their tags and the frequencies. Next, we need a function to fold through the Map and produce the tag with the highest...

Cliffhanger

The seventh chapter in nlpwp.org concerns part of speech tagging. What’s that? Well, imagine you have a corpus and you want to add tags to its words (or ‘tokens’) explaining their function. “The” is an article, “president” is a noun, “goes” is a verb etc. You could do it all by hand, but you could...

Full circle

It’s been a long trip, but here we are. Back at the beginning. The first, the last, my frequency. The first few weeks of this blog have basically been a laboratory and playground. Although it is not very hard to find the frequency of an n-gram in a corpus, we took some detours following the...

Where’s the Map?

In the previous posts, the frequency of n-grams in a given corpus was calculated using suffix arrays. Although this works well, I wondered if there was a more accessible way to find a string’s frequency. The Map type is being used a lot, so I rewrote some previous code using a Map: import qualified Data.Map...

It’s the economy, stupid

Playing around with suffix arrays and binary search functions was fun, but when we used them in combination with the Brown Corpus, they turned out to be far too slow to find the frequency of a word. In the previous post, I gave two possible reasons: building a suffix array of one million words takes...

The first, the last, my frequency

Yesterday we wrote a binary search function based on the work of Harm Brouwer and Daniël de Kok at nlpwp.org. Today we will write a binary search function that finds the frequency of a substring. It’s also based on the work at nlpwp.org. Okay… Imagine that we have a sorted array containing some characters: 0...

Hopeful Monsters

In the previous post, I became friends with the suffix array. In this post, we will play the part of Dr. Frankenstein and build our very own suffix array in Haskell. Indeed: we build our own friends! All the code is based on the website nlpwp.org, but I changed the name of functions and will...

Replace replace

The replace function from my earlier post appears to work well and it can be used in function composition. The only downside is its speed. If you want to replace seven characters, you have to make seven runs through the list and this takes time with the Brown Corpus – 12 seconds. So I looked...