The last few days have been quite hectic on this blog, with a lot of code being posted and lots of new things. Today, I’d like to start with a small recap of what has been happening here the last few weeks.

  1. The blog started off with the Brown Corpus: open it, return some statistics, close it.
  2. The focus then shifted to the frequency of words in the Corpus and the Point of Mutual Information of two words.
  3. Our little application brown4 was polished, with a few faster functions for finding and replacing strings.
  4. Next, the suffix arrays came to our attention: they are an interesting way to build ordered lists of things and, more specifically, n-grams.
  5. Binary search on suffix arrays was the next step.
  6. Finally, the binary search served as a solid basis in order to find the frequency of an element in a suffix array.

The last few days, I was feeling a bit insecure about the efficiency of suffix arrays and binary search to find the frequency of a word (or a n-gram) in a large corpus. It all worked well with the strings “abracadabra” and “to be or not to be”, but what would happen if we put the Brown Corpus through the following ordeal:

  1. Open it.
  2. Create a suffix array of its contents (characters). In the case of the Brown Corpus: 6,185,606 elements in the suffix array. The longest suffix will be 6,185,606 characters long, the next one 6,185,605 etc. That’s one huge array, Mister.
  3. Alternatively, build a suffix array of the words in the suffix array. In the case of the suffix array: 1,014,910 words, which says enough about the array.
  4. If we want to find a bigram, we will have to go through the array, and take nothing but the first two words. Then why did we create the suffix array in the first place?
  5. Then find the frequency based on our binary frequency function. Unfortunately, the functions to find the first and the last occurrence of an element in the sorted list need a lot more steps than a simple binary search does.

I did feed the Brown Corpus to our new binary functions today. And boy, is it slow. I mean slooooooooooow. Because all the steps above take time. First, I just timed it if we use a suffix array with n-grams on the character level:

>time ./brown5 freq ” the ” brown.txt
Frequency ” the “: Just 57842

real 6m44.510s
user 6m37.849s
sys 0m2.837s

If we use linear frequency function from the previous version, brown4 (length and filter in function composition), we get our result way faster:

>time ./brown4 freq “the” brown.txt
Frequency “the”: 69174

real 0m1.943s
user 0m1.832s
sys 0m0.039s

The suffix array of characters lost. And the suffix array of words?

>time ./brown5 freq -c “the” brown.txt
Frequency “the”: Just 69174

real 0m55.756s
user 0m51.906s
sys 0m0.779s

It lost, too. Tomorrow, I will do some profiling of the application, but my guess is that:

  • It takes way too long to create the suffix array.
  • The (binary) frequency function is not as efficient as it seems.

Am I right? We’ll see in the next post!