Using HaXml to clean legacy HTML pages
This page explains how HaXml could be used to clean legacy HTML pages:
1. The Introduction
HaXml is a collection of utilities for parsing, filtering, transforming, and generating XML documents using Haskell.
1 It is Malcolm Wallace's creation and a full feature list, tutorials and downloads can be found at the HaXml page.
When I was confronted with a website containing legacy (FrontPage and even weirder stuff) HTML, I started thinking about methods to clean these pages automatically. HTML Tidy is a helpful tool, but I needed more control over substitutions. Moreover, the pages contained a layout table that Tidy could not remove. A consultant had already made a perl script that cut out most of the layout tables, but not all. At this time, I decided that all the tables were to be removed, including those containing data. The latter were also dirty, with font tags and presentational attributes all over the place. They would have to be remade by hand or put into dynamic web pages, since accessibility should be high on the agenda.
I was still looking for a tool that provided powerful yet simple substitutions and filters. Kris De Schutter brought HaXml to my attention and after reading the paper by Malcolm Wallace and Colin Runciman, this small project was born. Before starting off with HaXml, you obviously need a background in Haskell. I used the book Haskell: The Craft of Functional Programming (second edition) by Simon Thompson, which is quite good for a starter. I still do not grasp all the consequences and advantages of functional programming, but reading the book is sufficient for working with HaXml. Moreover, when you install HaXml, library documentation is created automatically. The Haskell Cafe mailing list also provides helpful information.
One problem, however, was that HaXml needed XML as input. I used HTML Tidy to convert the dirty HTML pages to well-formed XML pages. After that, some fine-tuning was necessary, for which I used Perl substitutions. O'Reilly's 'Learning Perl, Fourth Edition' was an extremely helpful introduction.
!Important! The filters and substitutions I use have been developed for a particular website containing particular problems. Moreover, the markup I wanted to end up with is likely different from your needs. It is doubtful that you can simply copy/paste all the files and then clean your site. In this text, I simply want to point out that using HaXml for cleaning legacy sites has its merits. This method has not been used in production (yet).
2. The Example
Setting it up
Setting up the environment for our project is quite straightforward:
- Download ActivePerl if you are working with Windows. I used version 5.8.8 (build 817), but I guess it should work with more recent versions. By default, the application is installed in C:\perl and the path is added to your environment variables.
- Create the folder C:\executables. This is the folder where all executables (.exe, perl etc.) will be placed.
- Download tidy.exe and place it in the folder C:\executables.
- Download my HaXml zip archive and unpack it. Place all the files in the c:\executables folder.
If you want to work in the HaXml file itself and compile it, two more steps are necessary:
- Download the Glasgow Haskell Compiler. I used version 6.4 for the project and it was installed in C:\ghc on my computer.
- Download HaXml. I used version 1.13.2 and installed it in C:\haskell.
Trying it out
In order to test the files, you should create a folder with some dirty HTML files, enter it in the command prompt and then run the command c:\executables\runclean.bat. The file will create five folders, one for each state of the cleaning. This makes debugging and understanding the files easier. The cleaned file can be found in the folder perl2.
I use two kinds of dirty pages for this tutorial; they also look scruffy because I couldn't be bothered to write a decent CSS file. The dirty homepage is a simple page that reflects the junk FrontPage adds when building so-called HTML pages. When validated at W3C, it yields 184 errors. The XHTML page cleaned with HaXml, however, contains valid XHTML. The content looks ugly, but a computer cannot add structure to a file. It takes a human being to decide on the titles, lists etc. I did this in a few minutes' time and changed the doctype to strict. The result validates and looks like the original. The file size has shrunk from 10k in the original to a mere 4k in the strict file.
The tool has also been tested on pages generated by HTML Transit, which were then vandalised by inserting a layout table and editing the text in FrontPage. The result is quite awful, with 128 errors when validated. HaXml was used again to get rid of the mess and the cleaned page validates without a problem in the transitional doctype. After adding some structure and converting the doctype to strict, our messy Cinderella page became the prom queen (well, the HTML did). It lost weight, too: from 87k to 65k.
HaXml was tested on more pages, of course, some two hundred. Although not all pages validated immediately, errors were limited.
3. The Batch File
The batch file creates the five folders and calls the necessary applications. For more information you can take a look at the (commented) file. The order followed in this tutorial is the order of the files called in the batch file:
@echo off rem This is a batch file for cleaning legacy HTML pages (<Frontpage) rem The result is transitional xhtml rem It consists of the following methods: rem 1) HTML Tidy (http://tidy.sourceforge.net) rem 2) perl (http://www.perl.org) rem 3) HaXml (http://www.cs.york.ac.uk/fp/HaXml), based on↵
haskell (http://www.haskell.org) rem Made by Koen Roelandt on 23/11/2006 rem create directories. If the program is going in production, copying the files↵
to different folders can be skipped mkdir perl1 mkdir tidy1 mkdir haskell mkdir tidy2 mkdir perl2 rem copy files to directory perl1, enter it, and execute purpletext.pl rem purpletext.pl deletes webbots and cleans the font tag. See file for more information copy *.htm perl1; cd perl1 for %%i in (*.htm) do perl c:\executables\purpletext.pl %%i echo perl1 finished rem copy files to directory tidy1, enter it, and execute tidy rem HTML Tidy cleans web pages. The options are found in the config file. It prepares↵
the HTML for HaXml copy *.htm ..\tidy1 cd ..\tidy1 for %%i in (*.htm) do c:\executables\tidy.exe -config c:\executables\config.txt %%i echo tidy1 finished rem rename all files' extension to .xml, which is necessary for the HaXml exe ren *.htm *.xml rem copy files to directory haskell, enter it, execute HaXml (main.exe) and↵
execute entity_spaces.pl rem The source file for main.exe is Main.hs, which can be found↵
at C:\haskell\HaXml-1.13.2\HaXml-1.13.2\src rem See the documentation there for more information on Haskell, HaXml and compilation. copy *.xml ..\haskell cd ..\haskell for %%i in (*.xml) do c:\executables\main.exe %%i %%~ni.htm rem HaXml has a problem with spaces around entities (e.g. "&"). These problems are↵
solved in whiteSpaces.pl for %%i in (*.htm) do perl -0777 c:\executables\whiteSpaces.pl %%i echo haskell finished rem copy files to directory tidy2, enter it, and execute tidy rem clean the file again, especially to make it user-friendly copy *.htm ..\tidy2 cd ..\tidy2 for %%i in (*.htm) do c:\executables\tidy.exe -config c:\executables\config2.txt %%i echo tidy2 finished rem copy files to directory perl2, enter it, execute list.pl and execute webbot.pl rem lists.pl adds the finishing touches, see the file for more info. rem webbots.pl adds the menu header to the file copy *.htm ..\perl2 cd ..\perl2 for %%i in (*.htm) do perl -0777 c:\executables\lists.pl %%i echo perl2 finished cd .. rem end of batch
4. purpletext.pl
The perl files are used for handling the details, and they consist mostly of simple substitutions. The file purpletext.pl can be omitted since it deals with minor problems:
- At one time, a tag
<fontsize="5">was found in the markup and Haskell choked on it (cf. infra). Purpletext.pl corrects this. - The file also removes the "purpletext" webbot (hence its name). A webbot is a mysterious robot added by FrontPage for setting the date on a page, including other pages, inserting a counter etc. The purpletext webbot shows some meta information about the selected font tag, namely whether it is a <h1> or <h2>. Since this information is invaluable, we replace the webbot with the correct header tag.
#!/usr/bin/perl -w
use strict;
$^I=".p1";
while (<>) {
#replace explicit title / subtitle (webbot 'Purpletext') with <h1> and <h2>
s%<!--webbot.+?PREVIEW="Titel.+?<font.+?>(.+?)</font>%<h1>$1</h1>%gi;
s%<!--webbot.+?PREVIEW="subtitle.+?<font.+?>(.+?)</font>%<h2>$1</h2>%gi;
#correct <fontsize... => <font size
s%<fontsize%<font size%gi;
print;
}
5. HTML Tidy 1
HTML Tidy is a wonderful application that "tidies" dirty HTML pages. The problem with our pages is that they contain layout tables, which cannot be removed by Tidy. That was the main reason for turning to HaXml.
The markup is still dirty at this point, and the HaXml .exe file demands well-formed XML. In order to achieve this, we let Tidy clean the pages. The HaXml parser is already happy if the markup is well-formed. Check out the relevant Wikipedia page for more information on the difference between "valid" and "well-formed".
Tidy works with configuration files and in it we say that we want the line wrapping to be 240; the output should be HTML and XML and follow the strict document type; quotes should be escaped; <b> and <i> should be replaced with <strong> and <em>; font tags have to remain, since we still need them for HaXml; Tidy is not allowed to substitute presentational tags and attributes by CSS, since we still need them for HaXml; the result should be written to the same file; the location of the error file is specified; and numeric entities are not allowed.
wrap: 240 output-xhtml: yes output-xml: yes doctype: strict quote-marks: yes logical-emphasis: yes drop-font-tags: no clean: no write-back: yes error-file: error.txt numeric-entities: no
6. HaXml
How it works
In the batch file, the HTML files are then converted to XML because HaXml requires it. The haskell file is in fact a collection of functions and filters applied to the document tree. I removed most of the comments to improve readability:
module Main where
import Text.XML.HaXml
import IO
main = do
processXmlWith (cleanf `o` deep (tag "html"))
cleanf =
html
[ hhead [
headRun `o` (deep (tag "head")),
hchar []
]
, hbody
[ hwrap [secondRun `o` firstRun `o` (deep (tag "body"))]
]
]
headRun =
foldXml (txt ?> keep :>
tag "title" ?> keep :>
tag "link" ?> keep :>
tag "meta" ?> keep :>
attr "src" `o` tag "script" ?> mkFullScript :>
tag "script" ?> mkScript :>
tag "style" ?> none :>
children)
firstRun =
foldXml (txt ?> keep :>
fontSize5 ?> replaceTag "h1" :>
tag "i" /> fontSize4 ?> (mkSubtitle `o` (children `o` tag "i")) :>
fontSize4 /> tag "i" ?> (mkSubtitle `o` (children `o` tag "font")) :>
tag "i" /> fontSize3 ?> (mkSubtitle `o` (children `o` tag "i")) :>
fontSize3 /> tag "i" ?> (mkSubtitle `o` (children `o` tag "font")) :>
fontSize2 /> tag "b" ?> (replaceTag "h3" `o` (children `o` tag "font")):>
tag "b" /> fontSize2 ?> (replaceTag "h3" `o` (children `o` tag "b")) :>
(attrval("href", AttValue[Left "#top"])) ?> mkTopTab :>
(attrval("name", AttValue[Left "TopOfPage"])) ?> none :>
tag "table" /> tag "tr" /> (tag "td" `o`↵
(attrval("width", AttValue[Left "45%"]))) /> fontSize1 ?> mkCopyright :>
(attr "start" `o` tag "ol") ?> replaceTag "li" :>
finders ?> keep :>
(attrval("id", AttValue[Left "content"])) `o` tag "div" ?> keep :>
tag "div" ?> keep :>
tag "table" ?> keep :>
tag "tr" ?> keep :>
tag "td" ?> keep :>
--All tags to be replaced because of obselete attributes of depreciated status
tag "b" ?> keep :>
tag "blockquote"?> mkAddress :>
tag "caption"?> replaceTag "caption" :>
tag "cite" ?> replaceTag "cite" :>
tag "code" ?> replaceTag "code" :>
tag "dd" ?> replaceTag "dd" :>
tag "del" ?> replaceTag "del" :>
tag "dfn" ?> replaceTag "dfn" :>
tag "dl" ?> replaceTag "dl" :>
tag "dt" ?> replaceTag "dt" :>
tag "fieldset" ?> replaceTag "fieldset" :>
tag "font" ?> keep :>
tag "h2" ?> replaceTag "h2" :>
tag "h3" ?> replaceTag "h3" :>
tag "h4" ?> replaceTag "h4" :>
tag "h5" ?> replaceTag "h5" :>
tag "h6" ?> replaceTag "h6" :>
tag "ins" ?> replaceTag "ins" :>
tag "i" ?> keep :>
tag "kbd" ?> replaceTag "kbd" :>
tag "legend" ?> replaceTag "legend" :>
tag "li" ?> replaceTag "li" :>
tag "ol" ?> replaceTag "ol" :>
tag "p" ?> replaceTag "p" :>
tag "pre" ?> replaceTag "pre" :>
tag "q" ?> replaceTag "q" :>
tag "samp" ?> replaceTag "samp" :>
tag "strong" ?> replaceTag "strong" :>
tag "sub" ?> replaceTag "sub" :>
tag "sup" ?> replaceTag "sup" :>
tag "tt" ?> replaceTag "tt" :>
tag "ul" ?> replaceTag "ul" :>
--'Advanced' filters for tags.
attr "accesskey" `o` tag "a" ?> mkFullLink :>
attr "href" `o` tag "a" ?> mkLink :>
attr "name" `o` tag "a" ?> mkAnchor :>
attrval("alt",AttValue[Left "spacer.gif (44 bytes)"]) ?> none :>
attrval("width",AttValue[Left "1"]) ?> none :>
attrval("height",AttValue[Left "1"]) ?> none :>
attr "alt" `o` tag "img" ?> mkFullImage :>
tag "img" ?> mkImage :>
attr "src" `o` tag "script" ?> mkFullScript :>
tag "script" ?> mkScript :>
children)
secondRun =
foldXml (txt ?> keep :>
fontSize4 ?> replaceTag "h1" :>
fontSize3 ?> replaceTag "h2" :>
--fontSize2 ?> none :>
finders ?> keep :>
finders2 ?> keep :>
(attrval("class", AttValue[Left "goToTop"])) `o` tag "div" ?> keep :>
(attrval("class", AttValue[Left "copyright"])) `o` tag "div" ?> keep :>
(attrval("id", AttValue[Left "content"])) `o` tag "div" ?> keep :>
--All tags to be replaced because of obselete attributes of depreciated status
tag "h1" /> tag "em" ?> none :>
tag "b" ?> tag "strong" :>
tag "p" /> tag "h1" ?> none :>
tag "p" /> tag "h2" ?> none :>
tag "p" /> tag "h3" ?> none :>
tag "p" /> tag "h4" ?> none :>
children)
mkFullScript =
mkElemAttr "script" [ ("type", ("text/javascript"!)), ("src", ("src"?)) ]
[ children ]
mkScript =
mkElemAttr "script" [ ("type", ("text/javascript"!))]
[ children ]
mkFullLink =
mkElemAttr "a" [("href",("href"?)), ("accesskey", ("accesskey"?)) ]
[children]
mkLink =
mkElemAttr "a" [ ("href",("href"?)) ]
[ children ]
mkAnchor =
mkElemAttr "span" [ ("id",("name"?)) ]
[ children ]
mkFullImage =
mkElemAttr "img" [("src", ("src"?)), ("alt", ("alt"?))]
[children]
mkImage =
mkElemAttr "img" [("src",("src"?)), ("alt", ("Temp"!))]
[children]
mkSubtitle =
mkElemAttr "h1" [("class", ("subtitle"!))]
[children]
mkTopTab =
mkElemAttr "a" [("href",("#TopOfPage"!))]
[children]
mkCopyright =
mkElemAttr "div" [("class", ("copyright"!))]
[children]
mkLanguageTab =
mkElemAttr "img" [("src", ("src"?)), ("alt", ("alt"?))]
[children]
mkAddress =
mkElemAttr "div" [("class", ("address"!))]
[children]
fontSize5 = (attrval("size",AttValue[Left "5"]) `o` tag "font")
fontSize4 = (attrval("size",AttValue[Left "4"]) `o` tag "font")
fontSize3 = (attrval("size",AttValue[Left "3"]) `o` tag "font")
fontSize2 = (attrval("size",AttValue[Left "2"]) `o` tag "font")
fontSize1 = (attrval("size",AttValue[Left "1"]) `o` tag "font")
--lists of tags to be kept
finders = (tag "abbr" `union` tag "acronym" `union` tag "address" `union` tag "base"↵
`union` tag "br" `union` tag "button" `union`tag "col" `union` tag "colgroup"↵
`union` tag "em" `union` tag "form" `union` tag "h1" `union` tag "input"↵
`union` tag "label" `union` tag "noscript" `union` tag "object" `union` tag "optgroup"↵
`union` tag "option" `union` tag "param" `union` tag "select" `union` tag "strong"↵
`union` tag "textarea" `union` tag "var" `union` attrval("id", AttValue[Left "header"])↵
`union` attrval("class", AttValue[Left "accessibility"]) `union`↵
attrval("id", AttValue[Left "belgium"]) `union``union`↵
attrval("class", AttValue[Left "fedmenu"])↵
`union` attrval("class", AttValue[Left "accesskey"]) `union`↵
attrval("class", AttValue[Left "fedheadsearch"])↵
`union` attrval("class", AttValue[Left "fedmenu2"])↵
`union` attrval("id", AttValue[Left "logo"]) `union`↵
attrval("id", AttValue[Left "extra"])↵
`union` attrval("id", AttValue[Left "menu"])↵
`union` attrval("id", AttValue[Left "content"]) )
finders2 = ( tag "a" `union` tag "blockquote" `union` tag "caption" `union` tag "cite"↵
`union` tag "code" `union` tag "dd" `union` tag "del" `union` tag "dfn"↵
`union` tag "dl" `union` tag "dt" `union` tag "fieldset" `union` tag "h2" `union`↵
tag "h3" `union` tag "h4" `union` tag "h5" `union` tag "h6" `union` tag "img"↵
`union` tag "ins" `union` tag "kbd" `union` tag "legend" `union` tag "li"↵
`union` tag "ol" `union` tag "p" `union` tag "pre"↵
`union` tag "q" `union` tag "samp" `union` tag "script" `union` tag "span"↵
`union` tag "strong" `union` tag "sub" `union` tag "sup" `union` tag "tt"↵
`union` tag "ul")
First of all, we import the necessary modules from the HaXml library:
module Main where import Text.XML.HaXml import IO
The main function is called 'main'. It basically processes the XML, starting from the <html> tag, and then applies the function cleanf to it:
main = do processXmlWith (cleanf `o` deep (tag "html"))
What does cleanf do?
cleanf =
html
[ hhead [
headRun `o` (deep (tag "head")),
hchar []
]
, hbody
[ hwrap [secondRun `o` firstRun `o` (deep (tag "body"))]
]
]
cleanf is another function, that builds a <html> element and a <head> element. The contents of the <head> element is constructed by the headRun function applied to the <head> tag of the original XML file. We will deal with the headRun function shortly. hchar[] creates the character encoding.
Then we build the <body> element, which consists of the contents of the <body> element in the original XML file, fed to the function firstRun, which is then fed to the function secondRun. The result is placed inside a wrapper (<div id="wrapper"></div>).
headRun =
foldXml (txt ?> keep :>
tag "title" ?> keep :>
tag "link" ?> keep :>
tag "meta" ?> keep :>
attr "src" `o` tag "script" ?> mkFullScript :>
tag "script" ?> mkScript :>
tag "style" ?> none :>
children)
The function headRun contains filters that are applied recursively; foldXml takes care of this. It basically says that all the contents of the elements (i.e. txt) has to remain. This is also true for
the <title>, <link> and <meta> elements, including their attributes. The <style> element should be removed from the XML tree: all the CSS will be placed in external stylesheets. If a <script> element is found with the attribute src inside it, it has to be turned into a <script> tag that is standards-compliant:
mkFullScript =
mkElemAttr "script" [ ("type", ("text/javascript"!)), ("src", ("src"?)) ]
[ children ]
mkScript =
mkElemAttr "script" [ ("type", ("text/javascript"!))]
[ children ]
A <script> element is created with the attributes type="text/javascript" and src. The latter should contain the value of the src attribute in the original XML file, hence the question mark. The exclamation mark means that the text between quotes has to be inserted.
If no src attribute is found (the script is contained in the XML file and not in an external source file), a simple <script> element should be constructed without the src attribute. This is done in the mkScript function.
firstRun =
foldXml (txt ?> keep :>
fontSize5 ?> replaceTag "h1" :>
tag "i" /> fontSize4 ?> (mkSubtitle `o` (children `o` tag "i")) :>
fontSize4 /> tag "i" ?> (mkSubtitle `o` (children `o` tag "font")) :>
tag "i" /> fontSize3 ?> (mkSubtitle `o` (children `o` tag "i")) :>
fontSize3 /> tag "i" ?> (mkSubtitle `o` (children `o` tag "font")) :>
fontSize2 /> tag "b" ?> (replaceTag "h3" `o` (children `o` tag "font")) :>
tag "b" /> fontSize2 ?> (replaceTag "h3" `o` (children `o` tag "b")) :>
(attrval("href", AttValue[Left "#top"])) ?> mkTopTab :>
(attrval("name", AttValue[Left "TopOfPage"])) ?> none :>
tag "table" /> tag "tr" /> (tag "td"↵
`o` (attrval("width", AttValue[Left "45%"]))) /> fontSize1 ?> mkCopyright :>
(attr "start" `o` tag "ol") ?> replaceTag "li" :>
finders ?> keep :>
The function firstRun is applied to the <body> tag of the original XML file. Again, we want to keep all the text. What does fontsize5 do?
fontSize5 = (attrval("size",AttValue[Left "5"]) `o` tag "font")
It is basically a filter for <font size="5">blabla</font>. If it matches, the element should become a first level header (<h1>). The same is true for the other fontsize filters, but I want to do other things with them. For example, turn them into a subtitle
if they are found in combination with a <i> element, where the contents of the <i> element ("children of tag <i>") should be the text of the subtitle. The Haskell file mainly consists of a combination of filters and actions that have to be applied to them. The filters should be read from right to left. finders is a list of tags to be kept and can be found at the bottom of the file.
tag "p" ?> replaceTag "p" :>
tag "pre" ?> replaceTag "pre" :>
tag "q" ?> replaceTag "q" :>
tag "samp" ?> replaceTag "samp" :>
tag "strong" ?> replaceTag "strong"
Why do the <p> elements have to be replaced? FrontPage adds a lot of extra information to tags, such as valign, size, color, etc. These invalid attributes should disappear, so HaXml replaces the element without taking the
attributes along. <p align="left" color="pink" size="1">blabla</p> will simply become <p>blabla</p>. Other elements have to be kept:
tag "font" ?> keep :>
We need the font element in the secondRun function, so we keep it here. Attributes are filtered in the "advanced" filters:
attr "accesskey" `o` tag "a" ?> mkFullLink :>
attr "href" `o` tag "a" ?> mkLink :>
attr "name" `o` tag "a" ?> mkAnchor :>
attrval("alt",AttValue[Left "spacer.gif (44 bytes)"]) ?> none :>
attrval("width",AttValue[Left "1"]) ?> none :>
attrval("height",AttValue[Left "1"]) ?> none :>
attr "alt" `o` tag "img" ?> mkFullImage :>
tag "img" ?> mkImage :>
attr "src" `o` tag "script" ?> mkFullScript :>
tag "script" ?> mkScript :>
Most of these filters are quite straightforward. For example: any element containing an alt attribute with the value "spacer.gif (44 bytes)" should disappear. If the alt attribute is found in other cases, an <img> tag has to be constructed with the alt attribute, but without height or width (see mkFullImage). If the attribute name is found inside a <a> tag, it should become an anchor.
In fact, every filter can be changed to your liking. I choose the filters and actions with a specific kind of HTML in mind, in order apply CSS to it. Maybe you need a different document tree. That is entirely up to you.
secondRun =
foldXml (txt ?> keep :>
fontSize4 ?> replaceTag "h1" :>
fontSize3 ?> replaceTag "h2" :>
finders ?> keep :>
finders2 ?> keep :>
(attrval("class", AttValue[Left "goToTop"])) `o` tag "div" ?> keep :>
(attrval("class", AttValue[Left "copyright"])) `o` tag "div" ?> keep :>
(attrval("id", AttValue[Left "content"])) `o` tag "div" ?> keep :>
--All tags to be replaced because of obselete attributes of depreciated status
tag "h1" /> tag "em" ?> none :>
tag "b" ?> tag "strong" :>
tag "p" /> tag "h1" ?> none :>
tag "p" /> tag "h2" ?> none :>
tag "p" /> tag "h3" ?> none :>
tag "p" /> tag "h4" ?> none :>
children)
The result of firstRun is used as input for secondRun. Why not everything in the same run? I had some problems with the output and this was solved when two "runs" were used instead of one. I think this was due to the laziness of Haskell or the fact that the filters were recursive, but I don't really remember. The secondRun function is, again, pretty straightforward. Most functions and filters have been explained earlier in this text.
Compilation
I used GHC version 6.4 for the compilation. The file main.hs results in main.exe, which can then be called. Other formats are used on Linux, of course.
In order to compile, I put the file main.hs in C:\haskell\HaXml-1.13.2\src. Then I compiled it using the following command:
C:\haskell\HaXml-1.13.2\src>c:\ghc\ghc-6.4\bin\ghc.exe --make -package lang Main.hs
You will probably get some errors when doing that. If so, download my version of the Text folder (zip) and replace the folder C:\haskell\HaXml-1.13.2\src\Text with mine. This could cause problems with other HaXml applications, but I didn't find any solution other than deleting the lines GHC is complaining about.
Possible problems
I had my share of problems using HaXml. The first issue was that I had to feed it XML while all the files I had were in dirty HTML. This was solved using perl and HTML Tidy.
A second obstacle concerned the use of entities, e.g. . If they were found inside an attribute value, HaXml choked on the ampersand. I contacted Malcolm Wallace on this and eventually I located the problem in the Haskell library, namely in the Lex Module. This issue, however, has been dealt with in the new version of HaXml (version 1.13.2).
Small errors still arise, but I have found that it is mostly due to a mistake in the XML file that has not been cleaned up properly by the perl files or HTML Tidy (e.g. <fontsize="5"> instead of <font size="5">). HaXml will always refer to the line number of the XML file where the error was found.
7. whiteSpaces.pl
This perl file is called after HaXml. The latter exits markup that includes lots of whitespace. WhiteSpaces.pl removes them and deletes some other errors, such as a <p> inside an <a> element. It also puts a div around the top image:
##!/usr/bin/perl -w
use strict;
$^I=".p0";
while (<>) {
#replace \n around entities in Haskell output. We often have to keep the spaces,
#hence $i and ($i+1)
for (my $i = 1; $i <= 50; $i++) {
s%\n {$i}(&.+?;</\w+)\n %$1%gi;
s%\n {$i}(>&.+?; *?)\n {$i}%$1%gi;
s%\n {$i}(&.+?; *?)\n {$i}(&.+?; *?)\n {$i}%$1$2%gi;
s%\n {$i}(&.+?; *?)\n {$i}%$1%gi;
s%\n {$i}(&.+?; *?)\n {$i+1}%$1 %gi;
}
#remove <p> inside <a></a>
s%(<a.+?>)<p.+?>(.+?)</p>%$1$2%gis;
#put top image inside a div
s%(<a href="#TopOfPage"\s+><img src=".*?top_.+?\.gif"\s+alt="top"/></a\s+>)%↵
<div class="goToTop">$1</div>%gi;
print;
}
8. HTML Tidy 2
Tidy is then called again to pretty-print the file. Check the first configuration file for more explanation on the options, since they are mostly the same:
wrap: 240 hide-endtags: no output-xhtml: yes output-xml: yes doctype: loose quote-marks: yes logical-emphasis: yes drop-font-tags: no clean: no write-back: yes error-file: error.txt numeric-entities: no
9. lists.pl
Lists.pl is the larger perl file. In the beginning, there were some problems with list items not being nested properly or containing paragraphs. These problems were mostly solved with perl string substitutions. Later on, lists.pl was also used for other miscellanous replacements, such as adding alt texts, removing line breaks inside titles, deleting empty paragraphs, editing the copyright div or dealing with problematic ampersands. It would take us way too far to deal with all the substitutions in the file, since most depend on the site you are working on and the markup you want as an end result.
#!/usr/bin/perl -w
use strict;
my $string;
$^I=".p2";
while (<>) {
#replace <li><p>something</p></li> with <li>something</li>
s%<li>(.*?)<p>(.+?)</p>(.*?)</li>%<li>$1$2$3</li>%gis;
s%<li>\s*?<p>(.+?)<br />\s*?(.+?)</p>\s*?</li>%<li>$1<br />\n$2</li>%gi;
#replace <script language="javascript"> with <script type="text/javascript">
s%<script language="javascript".*?>%<script type="text/javascript">%gi;
#remove loadinframe
s%<script.+?loadinframe.+?</script>%%gis;
#manage spaces around accesskeys
s% (<span.+?"accesskey".*?>)%$1%gi;
#replace <p><br /> with <p>
s%<p><br />\n%<p>%gi;
#remove <p> in <li>
s%<li>(.*?)<p>(.+?)</p>(.*?)</li>%<li>$1$2$3</li>%gis;
#remove <p> </p>
s%<p> </p>%%gis;
#remove <strong> inside <h1>
s%<h1>(.*?)<strong>(.+?)</strong>(.*?)</h1>%<h1>$1$2$3</h1>%gi;
#remove second instance of <span id="TopOfPage">
s%<div id="wrapper">%<div id="wrapper"><span id="TopOfPage" />%gi;
#remove obselete bookmarks
s%<span id="P[0-9 \-].+?"></span>%%gi;
s%<p></p>%%gi;
s%<h1></h1>%%gi;
s%<span id="\d+">(.*?)</span>%$1%gi;
#remove breaks in <h1>
s%(<h1>.*?)<br />(.*?</h1>)%$1$2%gis;
#remove <![CDATA[..//]]> with //.
s%(<!\[CDATA\[)%//$1%gis;
#correct alt texts of update and new gifs
s%(src=".*?update\.gif".*?alt=(\n)?.+?>)%↵
src="/images/buttons/update\.gif" alt="Updated" />%gi;
s%(src=".*?new\.gif".*?alt=(\n)?.+?>)%src="/images/buttons/new\.gif" alt="New" />%gi;
#remove "inivisible index"
s%<a href=".+?Default.htm"><img src=".+?images/logo_anim_klein.gif" alt=".+?" /></a>%%gi;
#add alt text to toc image and change src file
s%(<a href=".*?home.+?".*?>.*?)<img src=".*?toc_.+?\.gif".+?</a>%↵
$1<img src="/images/navigation/toc\.gif" alt="Overview" /></a>%gi;
s%<img src="(\.\./)*images/navigation/top_.+?\.gif" alt=(\n)?.+?/>%↵
<img src="/images/navigation/top\.gif" alt="Top" />%gi;
s%<img src="(\.\./)*images/navigation/toc_.+?\.gif" alt=(\n)?.+?>%↵
<img src="/images/navigation/toc\.gif" alt="Overview" />%gi;
s%<img src="(\.\./)*images/navigation/right_.+?\.gif" alt=(\n)?.+?>%↵
<img src="/images/navigation/right\.gif" alt="Next" />%gi;
s%<img src="(\.\./)*images/navigation/left_.+?\.gif" alt=(\n)?.+?>%↵
<img src="/images/navigation/left\.gif" alt="Previous" />%gi;
#s%<img src="(\n)?.+?index_.+?\.gif" alt=(\n)?.+?>%%gi;
s%<img src="(\n)?.*?spacer.*?\.gif".*?>%%gi;
#remove noscript for menu
s%<noscript>.*?\[.*?(Menu|Next|Back|Top).*?\].*?</noscript>%%gis;
#deal with some problematic ampersands
s%(<body>.+?)([ >(])R\&D([^a-zA-Z0-9])%↵
$1$2<abbr title="Research and Development">R\&D</abbr>$2%g;
s%([ >(])S\&T([^a-zA-Z0-9])%$1<abbr title="Science and Technology">S\&T</abbr>$2%g;
s%([ >(])W\&T([^a-zA-Z0-9])%$1<abbr title="Science and Technology">S\&T</abbr>$2%g;
s%(<p>)?(<a href="#TopOfPage"><img src="/images/navigation/top.gif" alt="Top" /></a>)↵
(</p>)?% <div class="goToTop">$2</div>%gis;
print;
}
This text does not provide a full explanation, as I realise all too well. The goal of is not to present a tutorial, but to show that HaXml can be used for cleaning large websites.
