nlp

I am here to accumulate you

In the last post, we were building a POS (part of speech) tagger based on the work at www.nlpwp.org. We were already able to distil a Map from a corpus containing its tokens, their tags and the frequencies. Next, we need a function to fold through the Map and produce the tag with the highest...

Cliffhanger

The seventh chapter in nlpwp.org concerns part of speech tagging. What’s that? Well, imagine you have a corpus and you want to add tags to its words (or ‘tokens’) explaining their function. “The” is an article, “president” is a noun, “goes” is a verb etc. You could do it all by hand, but you could...

The crux of the biscuit

Classification and tagging In the previous post, I announced that I was going to work on classification. Unfortunately, the chapter in question on nlpwp.org is still very much under construction and only contains some introductory concepts on classification. Instead, I leaped to the chapter concerning part of speech tagging and worked my way through the...

Full circle

It’s been a long trip, but here we are. Back at the beginning. The first, the last, my frequency. The first few weeks of this blog have basically been a laboratory and playground. Although it is not very hard to find the frequency of an n-gram in a corpus, we took some detours following the...

Where’s the Map?

In the previous posts, the frequency of n-grams in a given corpus was calculated using suffix arrays. Although this works well, I wondered if there was a more accessible way to find a string’s frequency. The Map type is being used a lot, so I rewrote some previous code using a Map: import qualified Data.Map...

… and we’re back!

I haven’t been taking any trains the last few days, which means that I haven’t been working on my train project (this blog). But now I’m using the fiendish Belgian railway service again, and from tomorrow on you will be able to read frequent posts once more (no pun intended). The first one will concern...

For the heck of it

The last few days, I’ve been working on a quest for the most frequent n-gram in a corpus. This means that the frequency of every single n-gram in the corpus has to be found and then compared. Linear search proves far too slow for this feat because you have to iterate through the entire list...

It’s the economy, stupid

Playing around with suffix arrays and binary search functions was fun, but when we used them in combination with the Brown Corpus, they turned out to be far too slow to find the frequency of a word. In the previous post, I gave two possible reasons: building a suffix array of one million words takes...

What’s the frequency, Kenneth?

The last few days have been quite hectic on this blog, with a lot of code being posted and lots of new things. Today, I’d like to start with a small recap of what has been happening here the last few weeks. The blog started off with the Brown Corpus: open it, return some statistics,...

The first, the last, my frequency

Yesterday we wrote a binary search function based on the work of Harm Brouwer and DaniĆ«l de Kok at nlpwp.org. Today we will write a binary search function that finds the frequency of a substring. It’s also based on the work at nlpwp.org. Okay… Imagine that we have a sorted array containing some characters: 0...

Magic revisited

Linear versus binary In the previous post we learnt how to build a suffix array. This is: a sorted list of all the n-grams of a string or corpus. Nice… But what is it good for? In my first posts, we used linear search to find (the frequency of) a specific element in a list....

Hopeful Monsters

In the previous post, I became friends with the suffix array. In this post, we will play the part of Dr. Frankenstein and build our very own suffix array in Haskell. Indeed: we build our own friends! All the code is based on the website nlpwp.org, but I changed the name of functions and will...