Playing around with suffix arrays and binary search functions was fun, but when we used them in combination with the Brown Corpus, they turned out to be far too slow to find the frequency of a word. In the previous post, I gave two possible reasons:

  • building a suffix array of one million words takes time
  • the binary search is slower than we think when used to find the frequency of a word.

So I did some profiling and put the results in a pdf. saCompare is a memory hog with enormous demands on memory, so building a suffix array is not always a good idea. On the other hand, the binary frequency function performs quite well, so no worries there.
The only solution to the speed problem of building suffix arrays is saving them somewhere and then do a binary search on them, although the files will easily turn out to be a few gigabytes big. And every time something changes, you have to build a new suffix array and save it somewhere.

Instead, I went back to the linear frequency function (basically length . filter (==searchString) $ corpus) and built a few functions around it. The biGram function became an nGram function, which allows us to build a list of n-grams:

nGrams :: Int -> [String] -> [[String]]
nGrams n [] = []
nGrams n xs = take n xs : nGrams n (tail xs)

When I want to search for “vice president”, I create a list of bigrams and then search it. And if I’m looking for “the vice president”, I create a list of trigrams etc. In the end, this is faster than building a suffix array:

>time ./brown6 freq “a veteran jackson county legislator will ask the georgia house” brown.txt
Frequency “a”: 22851
Frequency “veteran”: 27
Frequency “jackson”: 34
Frequency “county”: 153
Frequency “legislator”: 7
Frequency “will”: 2221
Frequency “ask”: 121
Frequency “the”: 69174
Frequency “georgia”: 43
Frequency “house”: 575
Frequency “a veteran jackson county legislator will ask the georgia house”: 1
real 0m4.512s
user 0m4.317s
sys 0m0.193s

Of course, suffix arrays could be useful for other things, such as finding the most frequent n-gram in corpus. That’s what I’ll be trying next.