In the last post, we were building a POS (part of speech) tagger based on the work at www.nlpwp.org. We were already able to distil a Map from a corpus containing its tokens, their tags and the frequencies. Next, we need a function to fold through the Map and produce the tag with the highest frequency for each token.

tokenMostFreqTag :: M.Map Token (M.Map Tag Int) -> M.Map Token Tag 
tokenMostFreqTag = M.map (fst . M.foldlWithKey findMax ("NIL", 0))
   where findMax [email protected](_, maxFreq) tag freq	
                       | freq > maxFreq = (tag, freq) 
                       | otherwise = acc

M.foldWithKey is a foldr that folds through a Map, and more specifically through its keys and their values. We apply the function findMax to each member of the Map, with an initial accumulator of ("Nil",0).
The [email protected](_,maxFreq) code puzzled me a bit, but Real World Haskell and Learn You a Haskell provide a good explanation of as-patterns. We basically cut op the accumulator acc in three parts provided the as-pattern matches (“the accumulator is a tuple”):

  1. acc: the whole tuple, e.g. (“red”, 0).
  2. _: the first part of the tuple, e.g. “red”. Since this is of no importance to the function, we use the wildcard.
  3. maxFreq: the latter part of the tuple, e.g. 0.

The accumulator is compared to the arguments of findMax. If the latter’s frequency is greater, it becomes the accumulator. If not, the accumulator stays where it is.
The result of folding with findMax is a Map with a token (key) and a tuple with its most frequent tag and the frequency (value). We use fst to isolate the first of the pair; the result is a Map of tokens and their most frequent tag. Mission accomplished:

>tokenMostFreqTag $ tokenTagFreqs $ map toTokenTag $ words “zebra/red elephant/blue lion/green zebra/red zebra/green zebra/green zebra/red lion/blue lion/red lion/green”
fromList [(“elephant”,”blue”),(“lion”,”green”),(“zebra”,”red”)]

Our supermodel is finished now. In the next post, we will evaluate how well she predicts the tags of unknown tokens. If we don’t get accumulated by Arnold Schwarzenegger first (yikes!).