I wrote the first functions for the POS-tagger, all of them are based on the work at www.nlpwp.org. The types are the main difference here: Harm Brouwer and DaniĆ«l de Kok create a new data type, while I just use type synonyms. That way, the TokenTag type is basically a tuple containing two Strings. It just seemed a bit more intuitive, but we’ll see. I also removed the rsplit function because it ended up doing exactly the same as the toTokenTag function.

You ‘ll find the code below, it seems to work in ghci (I only did a short test).

  1. import qualified Data.Map as M
  2. import qualified Data.List as L
  3. import Data.List.Zipper as Z
  4. import Data.Maybe as DM
  5. import System.Environment (getArgs)
  6. import System.IO
  7.  
  8. {-----------
  9. TYPES
  10. -----------}
  11. type Token = String
  12. type Tag = String
  13. type TokenTag = (Token, Tag)
  14.  
  15. {-------------------
  16. POS MODEL TRAINER
  17. -------------------}
  18. -- split the tokens and the tags
  19. toTokenTag :: Char -> String -> TokenTag
  20. toTokenTag separator string =
  21. let (token, tag, _) = rsplit_ separator string
  22. in (token, tag)
  23.  
  24. rsplit_ :: (Eq a) => a -> [a] -> ([a], [a], Bool)
  25. rsplit_ separator = foldr (splitBool separator) ([],[],False)
  26. where
  27. splitBool separator letter (token, tag, True) = (letter:token, tag, True)
  28. splitBool separator letter (token, tag, False) | letter == separator = (token, tag, True)
  29. | otherwise = (token, letter:tag, False)
  30.  
  31. -- Calculate the frequency of tags for a token
  32. tokenTagFreqs :: [TokenTag] -> M.Map Token (M.Map Tag Int)
  33. tokenTagFreqs = L.foldl' countWord M.empty
  34. where
  35. countWord map1 (token, tag) = M.insertWith (countTag tag) token (M.singleton tag 1) map1
  36. countTag tag _ map2 = M.insertWith ( newFreq oldFreq -> oldFreq + newFreq) tag 1 map2