Yesterday we wrote a binary search function based on the work of Harm Brouwer and DaniĆ«l de Kok at nlpwp.org. Today we will write a binary search function that finds the frequency of a substring. It’s also based on the work at nlpwp.org.

Okay… Imagine that we have a sorted array containing some characters:

0 1 2 3 4 5 6 7 8 9
a a a b b c c c c d

How do we find the frequency of ‘c’ in the array (apart from looking at it)? Well… We substract the index of its first occurrence from the index of its last occurrence and add one:

frequency = sLast - sFirst + 1
frequency = 8 - 5 + 1
frequency = 4

The first

The function to find the first occurrence resembles the binary search function from the previous post. But we don’t return the index once we find the element we are looking for. Instead, we keep on moving to the left until we find the element that only has smaller elements on its left and larger or equal ones on its right. That will be the first occurrence.
Of course, if we don’t find the element at all, the function should return Nothing. We also add a little helper function:

firstOccur :: (Ord a) => V.Vector a -> a -> Int -> Int -> Maybe Int
firstOccur array string lower upper	
				| V.null array = Nothing
				| upper == lower = case compare string (array V.! lower) of 
					EQ -> Just lower
					_ -> Nothing
				| otherwise = case compare string (array V.! middle) of
					GT -> firstOccur array string (middle + 1) upper
					_ -> firstOccur array string lower middle
				where middle = (lower + upper) `div` 2

findFirst :: (Ord a) => V.Vector a -> a -> Maybe Int
findFirst array string = firstOccur array string 0 (V.length array -1)

The last

Next, we look for the last occurrence of our element. We do just the same as in firstOccur, except that we move right until we find the one single element that has smaller or equal elements on its left and larger ones on its right:

lastOccur :: (Ord a) => V.Vector a -> a -> Int -> Int -> Maybe Int
lastOccur array string lower upper 	
				| V.null array = Nothing
				| upper == lower = case compare string (array V.! lower) of
					EQ -> Just lower
					_ -> Nothing
				| otherwise = case compare string (array V.! middle) of
					LT -> lastOccur array string lower (middle - 1)
					_ -> lastOccur array string middle upper
				where middle = ((lower + upper) `div` 2) + 1

findLast :: (Ord a) => V.Vector a -> a -> Maybe Int
findLast array string = lastOccur array string 0 (V.length array - 1)

My frequency

And then we glue it all together using the calculation we just saw:

biFrequency :: (Ord a) => V.Vector a -> a -> Maybe Int
biFrequency array string  = do
			sLast 	<- findLast array string
			sFirst	<- findFirst array string
			return (sLast - sFirst + 1)

Let's give it a try in ghci:

*Main Data.Vector> biFrequency (Data.Vector.fromList "aaabbcccd") 'a'
Just 3
*Main Data.Vector> biFrequency (Data.Vector.fromList "aaabbcccd") 'b'
Just 2
*Main Data.Vector> biFrequency (Data.Vector.fromList "aaabbcccd") 'c'
Just 3
*Main Data.Vector> biFrequency (Data.Vector.fromList "aaabbcccd") 'd'
Just 1
*Main Data.Vector> biFrequency (Data.Vector.fromList "aaabbcccd") 'z'
Nothing

Cool. Now we will add one last function that takes a list and an element, turns the list into a suffix array and finds the frequency of the element.

frequency :: (Ord a) => [a] -> [a] -> Maybe Int
frequency array string = biFrequency shortArray (V.fromList string)
		where 
			stringLength = V.length . V.fromList $ string
			shortArray = V.map (V.take stringLength) (sToVector. sFromList $ array)

Let's play:

*Main Data.Vector> frequency ["to", "be", "or", "not", "to", "be"] ["be"]
Just 2
*Main Data.Vector> frequency ["to", "be", "or", "not", "to", "be"] ["to","be"]
Just 2
*Main Data.Vector> frequency ["to", "be", "or", "not", "to", "be"] ["or"]
Just 1

Frankly, all this typing of lists and square brackets in the Terminal is getting on our nerves. Strings are basically lists, right? So let's use strings:

*Main Data.Vector> frequency "to be or not to be" "be"
Just 2
*Main Data.Vector> frequency "to be or not to be" "t"
Just 3
*Main Data.Vector> frequency "to be or not to be" " "
Just 5
*Main Data.Vector> frequency "to be or not to be" " b"
Just 2
*Main Data.Vector> frequency "to be or not to be" "Waka Jawaka!"
Nothing
*Main Data.Vector> frequency "abracadabra" "c"
Just 1
*Main Data.Vector> frequency "abracadabra" "ac"
Just 1

Is it powerful or what? In the next post, we will feed the Brown Corpus to our brand new binary search and frequency functions and see if they still work well. And after that we will have a race! All together now: "The first, the last, my frequency!"