The seventh chapter in nlpwp.org concerns part of speech tagging. What’s that? Well, imagine you have a corpus and you want to add tags to its words (or ‘tokens’) explaining their function. “The” is an article, “president” is a noun, “goes” is a verb etc. You could do it all by hand, but you could also write a program to do it for you.
That’s what we’re going to do in the next few posts. The people at nlpwp.org use the following method:
Analyse a tagged corpus and build a model:
- Find a corpus A that has already been tagged.
- Build a list of the tokens and the tags associated with them.
- Build a list containing pairs of tokens and their most frequent tag.
Find another tagged corpus B to test the model against:
- Use the model to predict tags for the tokens in corpus B.
- Since corpus B is already tagged, one can check whether the model predicts this correctly.
Improve the model:
- There will certainly be tokens in corpus B that are not in the model. They all receive the “noun” tag because most unknown tokens will have that function.
- Use transformational rules to get better results.
I hope that I am explaining this correctly. If not: shout!
I want a divorce!
For the tagger, we will use the tagged Brown Corpus. It is the same as the corpus we used before, only that it’s been tagged (ain’t that a surprise). If you look at the first sentence, you’ll probably see a pattern emerging:
The/AT Fulton/NP County/NN Grand/JJ Jury/NN said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta’s/NP$ recent/JJ primary/NN election/NN produced/VBD “/“ no/AT evidence/NN ”/” that/CS any/DTI irregularities/NNS took/VBD place/NN ./.
That’s right! The token has been separated from the tag by a forward slash. Here are three functions that truly split them:
type Token = String type Tag = String data TokenTag = TokenTag Token Tag deriving Show toTokenTag :: String -> TokenTag toTokenTag s = let (token, tag) = rsplit '/' s in TokenTag token tag rsplit :: Eq a => a -> [a] -> ([a], [a]) rsplit separator alpha = let (ps, xs, _) = rsplit_ separator alpha in (ps, xs) rsplit_ :: Eq a => a -> [a] -> ([a],[a],Bool) rsplit_ separator = foldr (splitBool separator) ([],[],False) where splitBool separator letter (token, tag, True) = (letter:token, tag, True) splitBool separator letter (token, tag, False) | letter == separator = (token, tag, True) | otherwise = (token, letter:tag, False)
The function rsplit_
runs through a token/tag one letter at the the time starting from the right:
- If it hasn’t met the separator (/) yet, it adds the letter to the part reserved for the tag.
- When it meets the separator, it raises the flag that the separator has been met.
- Any letter met before that is added to the token part.
The function rsplit
simply pretty-prints the result without the flag. We wrap the result in our very own TokenTag
type, consisting of two strings. It does something like this:
>map toTokenTag $ words “zebra/red elephant/blue lion/green”
[TokenTag “zebra” “red”,TokenTag “elephant” “blue”,TokenTag “lion” “green”]
So you want to be a supermodel
If we use this on the Brown Corpus, we will get a list of tokens and their tag. Now we will order this list according to the tokens, their tags and the tags’ frequency. We will nest the Map type for this:
- We run through the list of TokenTags.
- We take the token and check if it’s already in our first Map:
- No? Let’s add it! The token will be the key of the map. The value is another Map. We add the tag to this second Map with a frequency of 1.
- Yes? Wait. Let’s check if the tag is already in the second Map:
- No? We add it with a frequency of 1.
- Yes? We add 1 to its frequency.
That must take a lot of code. Nope. Long live Haskell:
import qualified Data.Map as M import qualified Data.List as L tokenTagFreqs :: [TokenTag] -> M.Map Token (M.Map Tag Int) tokenTagFreqs = L.foldl' countWord M.empty where countWord map1 (TokenTag token tag) = ↵ M.insertWith (countTag tag) token (M.singleton tag 1) map1 countTag tag _ map2 = M.insertWith ( newFreq oldFreq -> oldFreq + newFreq) tag 1 map2
The key to understanding this function is meditation, right? Erm. No. It’s M.insertWith
:
- It takes a key, a value and a function and tries to insert the key and the value into the Map.
- If the key already exists, the function is called,
countTag
in this particular case. countTag
is a disguisedM.insertWith
and tries to insert the key (tag) and value (1) into the second Map.- If the key already exists, it calls a lambda abstraction that adds the value (1) to the the frequency.
Elegant, isn’t it?
>tokenTagFreqs $ map toTokenTag $ words “zebra/red elephant/blue lion/green zebra/red zebra/green”
fromList [(“elephant”,fromList [(“blue”,1)]),(“lion”,fromList [(“green”,1)]),(“zebra”,fromList [(“green”,1),(“red”,2)])]
There is one last step before our supermodel becomes a true star: find the most frequent tag for each token. That’s something for the next post! Don’t you just love cliffhangers? :-D