It’s been a long trip, but here we are. Back at the beginning. The first, the last, my frequency.

The first few weeks of this blog have basically been a laboratory and playground. Although it is not very hard to find the frequency of an n-gram in a corpus, we took some detours following the scenic routes of suffix arrays, the Brown Corpus, Vectors, Maps and binary search. In the end, I just decided to find a readable and fast way to calculate the frequency of (a set of) words. That is the subject of this post.

Bug: nGrams

All this time, I’ve been using the following function to create a list of n-grams:

nGrams :: Int -> [a] -> [[a]]
nGrams n [] = []
nGrams n xs = take n xs :  nGrams n (tail xs)

The function does create a list of n-grams, but the last few elements are not correct:

*Main> nGrams 3 [1..6]
[[1,2,3],[2,3,4],[3,4,5],[4,5,6],[5,6],[6]]

As you can see, the last 2 elements are not 3-grams. Maybe we could check the length of the remaining list? Once it dives under n, we simply stop the function. That’s exactly what they did at nlpwp.org:

nGrams :: Int -> [a] -> [[a]] 
nGrams 0 _ = [] 
nGrams _ [] = [] 
nGrams n xs
   | length xs >= n = take n xs : ngrams n (tail xs) 
   | otherwise	= []

The function works:

*Main> nGrams 3 [1..6]
[[1,2,3],[2,3,4],[3,4,5],[4,5,6]]

The only problem is that it becomes incredibly slow once you work with the Brown Corpus. length is a linear function: given a list of 1 million elements, it will have to cross 999,999 elements at the first element of the list to calculate the length, 999,998 elements at the second element etc.
Instead, one could work with streams, but I’m too much of a noob for that. Instead, we take just take the list except the (n-1) last elements:

*Main> let list = [1..6]
*Main> let n = 3
*Main>take (length list – (n – 1)) $ nGrams n list
[[1,2,3],[2,3,4],[3,4,5],[4,5,6]]

This does not slow down the function (or the application).

Frequency Map

A Map is an easy and nice way to reveal the frequency of all the elements of a corpus:

  • create an empty Map
  • iterate through the corpus we want the frequency list of
  • for each element of the corpus: check whether it’s in the Map
    • if it isn’t: add it with a frequency of 1
    • if it is: increase its frequency by 1

And that’s that. In Haskell, this results in the following function (nList is a list of n-grams, see code below):

frequency_map = Data.List.foldl' findFreq Data.Map.empty nList
findFreq :: M.Map [String] Int -> [String] -> M.Map [String] Int
findFreq fMap x 	| M.member x fMap = M.adjust (+1) x fMap
			| otherwise = M.insert x 1 fMap

After that, we just convert it back to a list and we can do whatever we want with it. In the full code below, I just took the highest frequency, but one could look for the lowest or the ten highest or whatever you can do with a list.

import qualified Data.Map as M
import qualified Data.List as L
import System.Environment (getArgs)
import System.IO
import Data.Char (toLower, toUpper)

main = do
	(command:args) <- getArgs
	case command of
		"freq" -> freq args
		"high" -> high args 
		_ -> putStrLn "Error: command does not existnPossible commands are:n↵
freq string filenamen"

{-frequency: calculate frequency of word -}		
freq :: [String] -> IO()
freq [string, fileName] = do
	fileHandle <- openFile fileName ReadMode
	contents <- hGetContents fileHandle
	let corpus = prepare contents
	let alpha = words string
	let n = length alpha		
	let nList  = take (length corpus - (n - 1)) $ nGrams n corpus
	putStrLn ("Frequency "" ++ string ++ "": " ++ (show $ frequency alpha nList))
	hClose fileHandle
freq _ = do
	putStrLn "Error: the freq command requires two arguments: freq word filename"

-- high: find the most frequent n-gram in the corpus
high :: [String] -> IO()
high [number, fileName] = do
   fileHandle <- openFile fileName ReadMode
   contents <- hGetContents fileHandle
   let corpus = prepare contents
   let n = read number::Int		
   let nList = take (length corpus - (n - 1)) $ nGrams n corpus
   let highest = L.last $ L.sort $ L.map ((x,y) -> (y,x)) ↵
(M.toList $ L.foldl' findFreq M.empty nList)
   putStrLn ((unwords $ snd $ highest) ++ ": " ++ (show $ fst highest) ++ " times")
   hClose fileHandle
high _ = do
   putStrLn "Error: the high command requires two arguments: high Int filename"

--frequency of a word alpha in the corpus contents
frequency :: (Eq a) => a -> [a] -> Int
frequency alpha contents = length . filter (==alpha) $ contents

--create a list of n-grams
nGrams :: Int -> [a] -> [[a]]
nGrams n [] = []
nGrams n xs = take n xs :  nGrams n (tail xs)

--convert list to Map
createMap :: (Ord a) => [[a]] -> M.Map Int [a]
createMap list = M.fromList (zip index list)
			where index = [0..(length list - 1)]

--clean a String: replace punctuation | map to lowercase | convert to a list of words
prepare :: String -> [String]
prepare = words . map toLower . replace'

replace' :: [Char] -> [Char]
replace' [] = []
replace' (x:xs) = case x of
				'.' -> " _." ++ replace' xs
				',' -> " _," ++ replace' xs
				';' -> " _;" ++ replace' xs
				'"' -> " _"" ++ replace' xs
				'!' -> " _!" ++ replace' xs
				':' -> " _:" ++ replace' xs
				'?' -> " _?" ++ replace' xs
				_ -> x : replace' xs

findFreq :: M.Map [String] Int -> [String] -> M.Map [String] Int
findFreq fMap x 	| M.member x fMap = M.adjust (+1) x fMap
			| otherwise = M.insert x 1 fMap

It works:

time ./final_brown high 3 brown.txt
_, and the: 651 times

real 0m25.909s
user 0m25.336s
sys 0m0.555s

I could take a closer look at the speed, but for now I will move on in the journey of Haskell and linguistics. The next subject is classification.