Blog

Travel is Dangerous

I haven’t posted something for a while, but I have been programming and rewriting parts of the posTagger from www.nlpwp.org: the tagger now works with three different files (training, testing, evaluation) it is trained based on the frequency of tags and adds the most common tag (“NN”). Next step is the implementation of transformational rules,...

The weight

A short post again. I wrote functions that add the default “NN” tag to the model if the token does not contain a tag. More precisely: when the tag value is Nothing instead of Just a tag. I also copied some functions that should run the model against an untagged test file. The distinction between...

LittleByLittle

I wrote the first functions for the POS-tagger, all of them are based on the work at www.nlpwp.org. The types are the main difference here: Harm Brouwer and Daniƫl de Kok create a new data type, while I just use type synonyms. That way, the TokenTag type is basically a tuple containing two Strings. It...

The Plan

My imitation of the POS-tagger at www.nlpwp.org was finished today and everything seems to work. Seems. The problem is that the nlpwp tagger was written for the sake of instruction and is not supposed to be a working application. The success of frequency-based tagging is evaluated, but the performance of the transformation-based tagging remains unclear,...

The Dragon Reborn

It’s been a while since I wrote a post for this weblog, and there are some (good) reasons: I’ve been very busy, both at work and at home. I relapsed into an old habit: reading The Wheel of Time. Parents, beware! Do not let your children read this stuff! I took a stroll into a...

West Indian Revelation

In this post we will: Test the efficiency of the POS tagger Improve its efficiency Power by numbers In order to develop the tagger, the people at www.nlpwp.org cut up the Brown corpus in two parts, one for training the tagger (brown-pos-train.txt) and one for testing it (brown-pos-test.txt). How do we test the tagger? We...

Taggart

In this post, we we will: present the last two functions for our tagger wrap the functions in some IO paper go for a test drive. Finishing touches The first function is simply a composition of tokenMostFreqTag and tokenTagFreqs to make our life a bit easier: trainFreqTagger :: [TokenTag] -> M.Map Token Tag trainFreqTagger =...

I am here to accumulate you

In the last post, we were building a POS (part of speech) tagger based on the work at www.nlpwp.org. We were already able to distil a Map from a corpus containing its tokens, their tags and the frequencies. Next, we need a function to fold through the Map and produce the tag with the highest...

Cliffhanger

The seventh chapter in nlpwp.org concerns part of speech tagging. What’s that? Well, imagine you have a corpus and you want to add tags to its words (or ‘tokens’) explaining their function. “The” is an article, “president” is a noun, “goes” is a verb etc. You could do it all by hand, but you could...

The crux of the biscuit

Classification and tagging In the previous post, I announced that I was going to work on classification. Unfortunately, the chapter in question on nlpwp.org is still very much under construction and only contains some introductory concepts on classification. Instead, I leaped to the chapter concerning part of speech tagging and worked my way through the...

Distractions

Belgium has been the hottest place in Europe during the last few days, and I have to admit that I was distracted by the sun. Another distraction from www.nlpwp.org was the book “The Haskell Road to Logic, Maths and Programming” that’s been sitting on my bookshelf for quite a while. Only recently did I conquer...

Full circle

It’s been a long trip, but here we are. Back at the beginning. The first, the last, my frequency. The first few weeks of this blog have basically been a laboratory and playground. Although it is not very hard to find the frequency of an n-gram in a corpus, we took some detours following the...