nlpwp.org

It’s the economy, stupid

Playing around with suffix arrays and binary search functions was fun, but when we used them in combination with the Brown Corpus, they turned out to be far too slow to find the frequency of a word. In the previous post, I gave two possible reasons: building a suffix array of one million words takes...

What’s the frequency, Kenneth?

The last few days have been quite hectic on this blog, with a lot of code being posted and lots of new things. Today, I’d like to start with a small recap of what has been happening here the last few weeks. The blog started off with the Brown Corpus: open it, return some statistics,...

The first, the last, my frequency

Yesterday we wrote a binary search function based on the work of Harm Brouwer and DaniĆ«l de Kok at nlpwp.org. Today we will write a binary search function that finds the frequency of a substring. It’s also based on the work at nlpwp.org. Okay… Imagine that we have a sorted array containing some characters: 0...

Magic revisited

Linear versus binary In the previous post we learnt how to build a suffix array. This is: a sorted list of all the n-grams of a string or corpus. Nice… But what is it good for? In my first posts, we used linear search to find (the frequency of) a specific element in a list....

Hopeful Monsters

In the previous post, I became friends with the suffix array. In this post, we will play the part of Dr. Frankenstein and build our very own suffix array in Haskell. Indeed: we build our own friends! All the code is based on the website nlpwp.org, but I changed the name of functions and will...

You are what you is

And a suffix array is a suffix array. But what is a suffix array? Good question! I’m still in the process of trying to understand this weird creature from some top secret government sorting programme, but I will try to give a comprehensive description of my findings. I probably got it wrong somewhere, so please...

Fortune-telling

The second chapter in “Natural Language Processing for the Working Programmer” (nlpwp.org) deals with bigrams, n-grams and collocations. So what are these weird things? Bigrams are pairs of words that follow each other in a sentence. Chomsky’s sentence “Colorless green ideas sleep furiously” can be split up in the following list of bigrams: [["Colorless","green"],["green","ideas"],["ideas","sleep"],["sleep","furiously"]] N-grams...

Broken promises

The cute little program “brown” worked, but it had some problems: When I used it on the complete Brown Corpus, it created a stack place overflow because of lazy evaluation. In human language: it crashed. It did everything at once. I want to be able to pass commands and arguments to the application. The frequency...

First post!

The big test… In order to do something with Haskell and linguistics, I figured that I had to get my fundamentals right. Just to get going, I tried to write an application to: open a file use its contents for some easy computations send the interesting results to the screen/Terminal. I based my little program...