nlpwp.org

Va, vis, deviens

Instead of writing code like crazy, I’m reading documentation on Brill’s tagger. The first interesting text is Eric Brill’s dissertation A Corpus-Based Approach to Language Learning, where he describes a method for transformation-based tagging. It seems that a few of my questions are being answered in his thesis, so I will work my way through it...

When God Created the Coffee Break

I haven’t been posting a lot lately, but I have been working on the POS tagger. Trust me :-). The last few weeks, I put emphasis on rewriting the tagger built by Daniël de Kok and Harm Brouwer (www.nlpwp.org, based on an article by Eric Brill). I tried to understand the functions and capture them...

Travel is Dangerous

I haven’t posted something for a while, but I have been programming and rewriting parts of the posTagger from www.nlpwp.org: the tagger now works with three different files (training, testing, evaluation) it is trained based on the frequency of tags and adds the most common tag (“NN”). Next step is the implementation of transformational rules,...

LittleByLittle

I wrote the first functions for the POS-tagger, all of them are based on the work at www.nlpwp.org. The types are the main difference here: Harm Brouwer and Daniël de Kok create a new data type, while I just use type synonyms. That way, the TokenTag type is basically a tuple containing two Strings. It...

The Plan

My imitation of the POS-tagger at www.nlpwp.org was finished today and everything seems to work. Seems. The problem is that the nlpwp tagger was written for the sake of instruction and is not supposed to be a working application. The success of frequency-based tagging is evaluated, but the performance of the transformation-based tagging remains unclear,...

Taggart

In this post, we we will: present the last two functions for our tagger wrap the functions in some IO paper go for a test drive. Finishing touches The first function is simply a composition of tokenMostFreqTag and tokenTagFreqs to make our life a bit easier: trainFreqTagger :: [TokenTag] -> M.Map Token Tag trainFreqTagger =...

I am here to accumulate you

In the last post, we were building a POS (part of speech) tagger based on the work at www.nlpwp.org. We were already able to distil a Map from a corpus containing its tokens, their tags and the frequencies. Next, we need a function to fold through the Map and produce the tag with the highest...

Cliffhanger

The seventh chapter in nlpwp.org concerns part of speech tagging. What’s that? Well, imagine you have a corpus and you want to add tags to its words (or ‘tokens’) explaining their function. “The” is an article, “president” is a noun, “goes” is a verb etc. You could do it all by hand, but you could...

The crux of the biscuit

Classification and tagging In the previous post, I announced that I was going to work on classification. Unfortunately, the chapter in question on nlpwp.org is still very much under construction and only contains some introductory concepts on classification. Instead, I leaped to the chapter concerning part of speech tagging and worked my way through the...

Distractions

Belgium has been the hottest place in Europe during the last few days, and I have to admit that I was distracted by the sun. Another distraction from www.nlpwp.org was the book “The Haskell Road to Logic, Maths and Programming” that’s been sitting on my bookshelf for quite a while. Only recently did I conquer...

Full circle

It’s been a long trip, but here we are. Back at the beginning. The first, the last, my frequency. The first few weeks of this blog have basically been a laboratory and playground. Although it is not very hard to find the frequency of an n-gram in a corpus, we took some detours following the...

Where’s the Map?

In the previous posts, the frequency of n-grams in a given corpus was calculated using suffix arrays. Although this works well, I wondered if there was a more accessible way to find a string’s frequency. The Map type is being used a lot, so I rewrote some previous code using a Map: import qualified Data.Map...