The seventh chapter in concerns part of speech tagging. What’s that? Well, imagine you have a corpus and you want to add tags to its words (or ‘tokens’) explaining their function. “The” is an article, “president” is a noun, “goes” is a verb etc. You could do it all by hand, but you could also write a program to do it for you.

That’s what we’re going to do in the next few posts. The people at use the following method:

Analyse a tagged corpus and build a model:

  • Find a corpus A that has already been tagged.
  • Build a list of the tokens and the tags associated with them.
  • Build a list containing pairs of tokens and their most frequent tag.

Find another tagged corpus B to test the model against:

  • Use the model to predict tags for the tokens in corpus B.
  • Since corpus B is already tagged, one can check whether the model predicts this correctly.

Improve the model:

  • There will certainly be tokens in corpus B that are not in the model. They all receive the “noun” tag because most unknown tokens will have that function.
  • Use transformational rules to get better results.

I hope that I am explaining this correctly. If not: shout!

I want a divorce!

For the tagger, we will use the tagged Brown Corpus. It is the same as the corpus we used before, only that it’s been tagged (ain’t that a surprise). If you look at the first sentence, you’ll probably see a pattern emerging:

The/AT Fulton/NP County/NN Grand/JJ Jury/NN said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta’s/NP$ recent/JJ primary/NN election/NN produced/VBD “/“ no/AT evidence/NN ”/” that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

That’s right! The token has been separated from the tag by a forward slash. Here are three functions that truly split them:

type Token = String 
type Tag = String
data TokenTag = TokenTag Token Tag
	deriving Show

toTokenTag :: String -> TokenTag
toTokenTag s = 	let (token, tag) = rsplit '/' s 
				in TokenTag token tag

rsplit :: Eq a => a -> [a] -> ([a], [a])
rsplit separator alpha = let (ps, xs, _) = rsplit_ separator alpha
                         in (ps, xs)
rsplit_ :: Eq a => a -> [a] -> ([a],[a],Bool)
rsplit_ separator = foldr (splitBool separator) ([],[],False)
     splitBool separator letter (token, tag, True) = (letter:token, tag, True)
     splitBool separator letter (token, tag, False) | letter == separator = (token, tag, True)
					            | otherwise = (token, letter:tag, False)

The function rsplit_ runs through a token/tag one letter at the the time starting from the right:

  • If it hasn’t met the separator (/) yet, it adds the letter to the part reserved for the tag.
  • When it meets the separator, it raises the flag that the separator has been met.
  • Any letter met before that is added to the token part.

The function rsplit simply pretty-prints the result without the flag. We wrap the result in our very own TokenTag type, consisting of two strings. It does something like this:

>map toTokenTag $ words “zebra/red elephant/blue lion/green”
[TokenTag “zebra” “red”,TokenTag “elephant” “blue”,TokenTag “lion” “green”]

So you want to be a supermodel

If we use this on the Brown Corpus, we will get a list of tokens and their tag. Now we will order this list according to the tokens, their tags and the tags’ frequency. We will nest the Map type for this:

  • We run through the list of TokenTags.
  • We take the token and check if it’s already in our first Map:
    • No? Let’s add it! The token will be the key of the map. The value is another Map. We add the tag to this second Map with a frequency of 1.
    • Yes? Wait. Let’s check if the tag is already in the second Map:
      • No? We add it with a frequency of 1.
      • Yes? We add 1 to its frequency.

That must take a lot of code. Nope. Long live Haskell:

import qualified Data.Map as M
import qualified Data.List as L

tokenTagFreqs :: [TokenTag] -> M.Map Token (M.Map Tag Int) 
tokenTagFreqs = L.foldl' countWord M.empty
      countWord map1 (TokenTag token tag) = ↵
                 M.insertWith (countTag tag) token (M.singleton tag 1) map1 
      countTag tag _ map2 = M.insertWith ( newFreq oldFreq -> oldFreq + newFreq) tag 1 map2

The key to understanding this function is meditation, right? Erm. No. It’s M.insertWith:

  • It takes a key, a value and a function and tries to insert the key and the value into the Map.
  • If the key already exists, the function is called, countTag in this particular case.
  • countTag is a disguised M.insertWith and tries to insert the key (tag) and value (1) into the second Map.
  • If the key already exists, it calls a lambda abstraction that adds the value (1) to the the frequency.

Elegant, isn’t it?

>tokenTagFreqs $ map toTokenTag $ words “zebra/red elephant/blue lion/green zebra/red zebra/green”
fromList [(“elephant”,fromList [(“blue”,1)]),(“lion”,fromList [(“green”,1)]),(“zebra”,fromList [(“green”,1),(“red”,2)])]

There is one last step before our supermodel becomes a true star: find the most frequent tag for each token. That’s something for the next post! Don’t you just love cliffhangers? :-D