The last few days, I’ve been working on a quest for the most frequent n-gram in a corpus. This means that the frequency of every single n-gram in the corpus has to be found and then compared. Linear search proves far too slow for this feat because you have to iterate through the entire list every time you want to find the frequency of an element.
Binary search (using suffix arrays), requires less steps. I did not really compare linear and binary search using the Brown Corpus, but the former will probably take a few hours, while the latter comes up with a result within a few minutes (between 3 and 4).

The code for the binary search is quite simple. We simply look for the maximum in a list, not according to its value, but according to its frequency:

high :: [String] -> IO()
high [number, fileName] = do
	fileHandle <- openFile fileName ReadMode
	contents <- hGetContents fileHandle
	let wordList = prepare contents
	let n = read number::Int		
	let nList = V.map (V.take n) (sToVector. sFromList $ wordList)
	let highest = V.maximumBy (flCompare nList) nList
	putStrLn (unwords . V.toList $  highest)
	hClose fileHandle
high _ = do
	putStrLn "Error: the high command requires two arguments: high Int filename"
		
flCompare :: (Ord a) => V.Vector a -> a -> a -> Ordering
flCompare nList x y = compare (biFrequency nList x) (biFrequency nList y)

In the code we:

  1. read the contents of the Brown Corpus and clean it up a bit
  2. read the number from the command line
  3. build a Vector of n-grams according to that number (nList)
  4. look for the maximum of the list according to our own compare function
  5. add our own compare function where we find the frequency using binary search (flCompare).

>time ./brown5 high 5 brown.txt
_” _, he said _.

real 3m52.135s
user 3m50.700s
sys 0m1.254s
>time ./brown5 high 2 brown.txt
of the

real 3m17.254s
user 3m15.956s
sys 0m1.112s

And that’s that. The rest of the time I tried to speed up the process and played around with the types (Vector, SuffixArray, list…). The next few days, I will get acquainted with the Data.Map module and try to rewrite the whole frequency story using a Map. Just for the heck of it.