I haven’t posted something for a while, but I have been programming and rewriting parts of the posTagger from www.nlpwp.org:
- the tagger now works with three different files (training, testing, evaluation)
- it is trained based on the frequency of tags
- and adds the most common tag (“NN”).
Next step is the implementation of transformational rules, but I have to dive into the Data.List.Zipper package first. There is also a small difference in statistics from the nlpwp.org version that I have to check.
- import qualified Data.Map as M
- import qualified Data.List as L
- import Data.List.Zipper as Z
- import System.Environment (getArgs)
-
- {-----------
- DATA TYPES
- -----------}
- type TokenTag = (Token, Tag)
-
- {----------------
- IO: dealing with the outside world
- -----------------}
- main = do
- (command:args) <- getArgs
- case command of
- "tag" -> tag args
-
- tag [trainName, testName, evalName] = do
- -- open files and retrieve contents
- trainFile <- openFile trainName ReadMode
- trainContents <- hGetContents trainFile
- testFile <- openFile testName ReadMode
- testContents <- hGetContents testFile
- evalFile <- openFile evalName ReadMode
- evalContents <- hGetContents evalFile
- -- test frequency tagging and print results
- putStrLn "***Frequency tagging***"
- let statFreqTags = compareMaybe evalTokenTag freqTags [0,0,0]
- prettyPrint statFreqTags
- -- test frequency tagging + base tagging and print results
- putStrLn "***Base tagging***"
- let baseTags = baseTest freqTags "NN"
- let statBaseTags = compareTag evalTokenTag baseTags [0,0]
- prettyPrint statBaseTags
- hClose testFile
- hClose evalFile
- test _ = do
- putStrLn "Error: the tag command requires three arguments: tag trainName testName evalName"
-
- {-------------------
- MODEL TRAINER
- -------------------}
- -- split the tokens and the tags
- toTokenTag separator string =
- let (token, tag, _) = rsplit_ separator string
- in (token, tag)
-
- where
- splitBool separator letter (token, tag, True) = (letter:token, tag, True)
- splitBool separator letter (token, tag, False) | letter == separator = (token, tag, True)
- -- Step 1: Frequency
- -- Calculate the frequency of tags for a token
- where
- countWord map1 (token, tag) = M.insertWith (countTag tag) token (M.singleton tag 1) map1
- countTag tag _ map2 = M.insertWith ( newFreq oldFreq -> oldFreq + newFreq) tag 1 map2
-
- -- Find the most frequent tag for a token
- where findMax acc@(_, maxFreq) tag freq
- | freq > maxFreq = (tag, freq)
-
- -- Step 2: add default tag
- -- add default tag to all tokens without a tag (probably"NN")
- defaultTag baseTag (token, Nothing) = (token, baseTag)
- defaultTag baseTag (token, Just tag) = (token, tag)
-
- {-------------
- MODEL TESTER
- --------------}
-
-
-
- {-------------
- MODEL EVALUATOR
- ---------------}
- compareTag [] [] [number, correct] = [number, correct]
- compareTag _ [] [number, correct] = [number, correct]
- compareTag [] _ [number, correct] = [number, correct]
- compareTag (x:xs) (y:ys) [number, correct] = compareTag xs ys [(number + 1), (correct + (eval x y))]
- where eval tagOriginal tagComputed | tagOriginal == tagComputed = 1
-
- compareMaybe [] [] [number, correct, empty] = [number, correct, empty]
- compareMaybe _ [] [number, correct, empty] = [number, correct, empty]
- compareMaybe [] _ [number, correct, empty] = [number, correct, empty]
- compareMaybe (x:xs) (y:ys) [number, correct, empty] =
- compareMaybe xs ys [(number + 1), (correct + (evalJust x y)), (empty + (evalNothing x y))]
- where
- evalJust (_,tagOriginal) (_,tagComputed) | tagOriginal == (DM.fromMaybe "NIHIL" tagComputed) = 1
- evalNothing (_, tagOriginal) (_,tagComputed) | tagComputed == Nothing = 1
-
- prettyPrint [number, correct, empty] = do
- prettyPrint [number, correct] = do
-
When you run this in the Terminal, it goes a little something like this:
koenroelandt$ ./posTagger tag train.txt test.txt eval.txt
***Frequency tagging***
Total tagged: 95.773186 %
Total tagged correctly: 87.66148 %
Total empty: 4.226818 %
Total tagged known: 91.530304 %
***Base tagging***
Total tagged correctly: 88.52606 %