In the previous post, I became friends with the suffix array. In this post, we will play the part of Dr. Frankenstein and build our very own suffix array in Haskell. Indeed: we build our own friends! All the code is based on the website nlpwp.org, but I changed the name of functions and will explain the code step by step in my own words.

Okay. If we want to build our very own suffix array, we first need a way to:

  1. compare the elements of a string (or a list)
  2. sort the elements based on the comparison
  3. get the suffix array
  4. store both the original string and the suffix array.

Enter the compare function, which resides in the standard library of Haskell. I played around with it for a bit and it utters three things: LT (less than), EQ (equal), GT (greater than). This sounds obvious for numbers, but we can also feed it characters, lists of characters (i.e. strings) and lists of strings. Let’s take compare for a spin in ghci:

Prelude> compare 1 9
LT
Prelude> compare ‘a’ ‘z’
LT
Prelude> compare ‘a’ ‘a’
EQ
Prelude> compare ‘z’ ‘a’
GT
Prelude> compare “waka” “jawaka”
GT
Prelude> compare “to be” “not to be”
GT
Prelude> compare [“to”,”be”,”or”,”not”,”to”,”be”] [“to”, “be”]
GT

The next thing we need is a function that does the actual sorting. I wonder what its name is? Well… Not sort but sortBy, from the Data.List module:

sortBy :: (a -> a -> Ordering) -> [a] -> [a]

sortBy takes a comparison function (compare in our case) and a list of things to be sorted according to that function. Let’s take it for a ride:

*Main> :m Data.List
Prelude Data.List> sortBy compare [3,2,1]
[1,2,3]
Prelude Data.List> sortBy compare “to be or not to be”
”     bbeenoooorttt”
Prelude Data.List> sortBy compare [“to be or not to be”, “be or not to be”, “or not to be”, “not to be”, “to be”, “be”]
[“be”,”be or not to be”,”not to be”,”or not to be”,”to be”,”to be or not to be”]

The third limb of our monster of Frankenstein is the vessel to store both the original string and the suffix array. We will use the data type SuffixArray that contains two lists, one for the string (“corpus”) and one for the array. Not the regular off-the-shelf kind of lists, no… We will use the Vector data type, comparable to arrays in other languages because they support random access. Data.Vector is not in the standard library, you’ll have to download it using cabal.

import qualified Data.Vector as V
import qualified Data.List as L
data SuffixArray a = SuffixArray (V.Vector a) (V.Vector Int)
	deriving Show

Next, we write a function that compares two n-grams:

saCompare :: Ord a => V.Vector a -> Int -> Int -> Ordering
saCompare corpus index1 index2 = compare (V.drop index1 corpus) (V.drop index2 corpus)

What does it do? Well, it takes a corpus and compares two n-grams of that corpus. But it’s easier if you just try it out:

Prelude> :m Data.Vector
Prelude Data.Vector> :l buildArray.hs
[1 of 1] Compiling Main ( buildArray.hs, interpreted )
Ok, modules loaded: Main.
*Main Data.Vector> let corpus = Data.Vector.fromList [“to”,”be”,”or”,”not”,”to”,”be”]
Loading package primitive-0.3.1 … linking … done.
Loading package vector-0.7.0.1 … linking … done.
*Main Data.Vector> saCompare corpus 5 5
EQ

Why is this equal? We’re comparing the same thing, namely the last element of the list: ["be"]

*Main Data.Vector> saCompare corpus 4 5
GT

Now we’re comparing ["to","be"] to ["be"] and the former is – lexicographically – greater than the latter. What would happen if we used saCompare in our sorting function to sort a list of Ints? Let’s find out!

*Main Data.Vector> Data.List.sortBy (saCompare corpus) [0..5]
[5,1,3,2,4,0]
*Main Data.Vector> Data.List.sortBy (saCompare $ Data.Vector.fromList “abracadabra”) [0..11]
[11,10,7,0,3,5,8,1,4,6,9,2]

That’s the suffix array! Cool! Now we just need to wrap it in a function that

  • finds the upper boundary of the list of Ints (i.e. the length of the corpus minus one)
  • stores the results in our SuffixArray data type.
buildArray :: Ord a => V.Vector a -> SuffixArray a
buildArray corpus = SuffixArray (corpus) (V.fromList index)
		where
			upperBound = (V.length corpus - 1)
			index = L.sortBy (saCompare corpus) [0..upperBound]

Let’s give it a spin:

*Main Data.Vector> buildArray corpus
SuffixArray fromList [“to”,”be”,”or”,”not”,”to”,”be”] :: Data.Vector.Vector fromList [5,1,3,2,4,0] :: Data.Vector.Vector
*Main Data.Vector> buildArray (Data.Vector.fromList “abracadabra”)
SuffixArray fromList “abracadabra” :: Data.Vector.Vector fromList [10,7,0,3,5,8,1,4,6,9,2] :: Data.Vector.Vector

And with this, ladies and gentlemen, we built our very own monster: a suffix array in Haskell. Just keep it off the ice, will ya? But there should be more to our little monster. Just a few more limbs, maybe a mouth and some hair:

sFromList :: Ord a => [a] -> SuffixArray a
sFromList = buildArray . V.fromList

sToList :: SuffixArray a -> [[a]]
sToList (SuffixArray corpus index) = V.foldr buildSuffixes [] index
			where buildSuffixes pos list = V.toList (V.drop pos corpus) : list

sFromList will simply take a list, turn it into a Vector and feed it to buildArray. sToList on the other hand, will accept a SuffixArray and then spawn an ordered list of the possible suffixes. These two combined are pure magic:

*Main> sToList . sFromList $ [“to”, “be”, “or”, “not”, “to”, “be”]
[[“be”],[“be”,”or”,”not”,”to”,”be”],[“not”,”to”,”be”],[“or”,”not”,”to”,”be”],[“to”,”be”],
[“to”,”be”,”or”,”not”,”to”,”be”]]
*Main> sToList . sFromList $ “abracadabra”
[“a”,”abra”,”abracadabra”,”acadabra”,”adabra”,”bra”,”bracadabra”,”cadabra”,”dabra”,”ra”,”racadabra”]