In this post, we we will:

  • present the last two functions for our tagger
  • wrap the functions in some IO paper
  • go for a test drive.

Finishing touches

The first function is simply a composition of tokenMostFreqTag and tokenTagFreqs to make our life a bit easier:

trainFreqTagger :: [TokenTag] -> M.Map Token Tag 
trainFreqTagger = tokenMostFreqTag . tokenTagFreqs

The second function is a bit problematic. At www.nlpwp.org, they used the following function:

freqTagWord :: M.Map Token Tag -> Token -> Maybe Tag 
freqTagWord m t = M.lookup w t

It’s basically an alias for the Map.lookup function, that finds a token and then returns its value or Nothing. However, the function above will throw 2 errors:

  1. Not in scope: `w’: the w variable is not found on the left hand side of the expression, so ghc considers it out of scope.
  2. Type signature: even if we correct the function, there will be an error. The above function basically uses the type signature freqTagWord :: Ord k => M.Map k a -> k -> Maybe a. M.lookup uses the signature lookup :: Ord k => k -> Map k a -> Maybe a.

So the correct function would be:

freqTagWord :: M.Map Token Tag -> Token -> Maybe Tag 
freqTagWord myMap token = M.lookup token myMap

Wrap it!

In the following code we throw all the work from the previous posts together and wrap it some ugly IO stuff:

import qualified Data.Map as M
import qualified Data.List as L
import System.Environment (getArgs)
import System.IO

type Token = String 
type Tag = String
data TokenTag = TokenTag Token Tag
   deriving Show


main = do
     (command:args) <- getArgs
        case command of
           "taggart" -> taggart args
           _ -> putStrLn "Error: command does not existnPossible commands are:↵
          n-taggart string filenamen"

taggart :: [String] -> IO()
taggart [trainName, string] = do
     fileHandle <- openFile trainName ReadMode
     contents <- hGetContents fileHandle
     let model = trainFreqTagger $ map toTokenTag $ words contents
     let alpha = words string
     let tagged = map (freqTagWord model) alpha
     let result = zip alpha tagged 
     putStrLn (show result)
     hClose fileHandle
taggart _ = do
     putStrLn "Error: the taggart command requires two arguments: taggart filename expression"

rsplit :: Eq a => a -> [a] -> ([a], [a])
rsplit separator alpha = let (ps, xs, _) = rsplit_ separator alpha in
						(ps, xs)
						
rsplit_ :: Eq a => a -> [a] -> ([a],[a],Bool)
rsplit_ separator = foldr (splitBool separator) ([],[],False)
     where 	
       splitBool separator letter (token, tag, True) = (letter:token, tag, True)
       splitBool separator letter (token, tag, False) | letter == separator = (token, tag, True)
                                                      | otherwise = (token, letter:tag, False)

toTokenTag :: String -> TokenTag
toTokenTag s = 	let (token, tag) = rsplit '/' s 
                in TokenTag token tag

tokenTagFreqs :: [TokenTag] -> M.Map Token (M.Map Tag Int) 
tokenTagFreqs = L.foldl' countWord M.empty
   where 
      countWord map1 (TokenTag token tag) = ↵
                 M.insertWith (countTag tag) token (M.singleton tag 1) map1 
      countTag tag _ map2 = M.insertWith ( newFreq oldFreq -> oldFreq + newFreq) tag 1 map2

tokenMostFreqTag :: M.Map Token (M.Map Tag Int) -> M.Map Token Tag 
tokenMostFreqTag = M.map (fst . M.foldlWithKey findMax ("NIL", 0))
	where findMax [email protected](_, maxFreq) tag freq	| freq > maxFreq = (tag, freq) 
						| otherwise = acc

trainFreqTagger :: [TokenTag] -> M.Map Token Tag 
trainFreqTagger = tokenMostFreqTag . tokenTagFreqs

freqTagWord :: M.Map Token Tag -> Token -> Maybe Tag 
freqTagWord myMap token = M.lookup token myMap

Run it!

>./morse taggart brown-pos-train.txt “The cat is on the mat .”
[(“The”,Just “AT”),(“cat”,Just “NN”),(“is”,Just “BEZ”),(“on”,Just “IN”),(“the”,Just “AT”),(“mat”,Just “NN”),(“.”,Just “.”)]
>./morse taggart brown-pos-train.txt “The vice-president presided over the president’s press conference .”
[(“The”,Just “AT”), (“vice-president”,Just “NN”),(“presided”,Just “VBD”),(“over”,Just “IN”),(“the”,Just “AT”),(“president’s”,Just “NN$”),(“press”,Just “NN”), (“conference”,Just “NN”),(“.”,Just “.”)]

Ladies and gentlemen… We have ourselves a tagger… Yeeha! We will try to improve it in the next post.