Wrestling HTML
September 8, 2004
Lately I've seen HTML parsing problems everywhere. One project needed a web crawler with specialized features provided through Python code that processed arbitrary HTML. There have also been several threads on mailing lists I frequent (including XML-SIG) featuring discussions of mechanisms for dealing with broken HTML by converting it to decent XHTML. This article focuses on Python APIs for converting good or bad HTML to XML.
Based on glowing testimonials from others with HTML-parsing tasks, I looked first at BeautifulSoup; but clearly, based on the description of the project goals and examination of the API, BeautifulSoup is suited for extracting bits of data from HTML, rather than converting it into XML. I did, however, work up Listing 1 as a simple test-case of bad XHTML, based on an example in the BeautifulSoup documentation.
Listing 1: An Example of Bad HTML
<body> Go <a class="that" href="here.html"><i>here</i></a> or <i>go <b><a href="index.html">Home</a> <!--noncetag>spam</noncetag><!--eggs--> </html>
Notice the broken comment in the file. I added it in because I've seen HTML parsers
tripped
up by strange use of comments.
<!--noncetag>spam</noncetag><!--eggs-->
is a bad comment
because it contains two dashes in its body.
uTidyLib
uTidyLib is a Python wrapper for the HTML Tidy Library Project (libtidy), an embeddable
variation on Dave Raggett's HTML Tidy command-line program. Libtidy is in C and uTidyLib
is
a minimalist and straightforward wrapping. I downloaded uTidylib-0.2.zip and installed
it.
It requires Libtidy and I downloaded and installed the source code package
tidy_src.tgz
dated 11 August 2004. uTidyLib also uses ctypes, "a Python package to
create and manipulate C data types in Python, and to call functions in dynamic link
libraries/shared dlls." I downloaded and installed ctypes-0.9.0.tar.gz
. In all
there were a lot of parts to find and set up, but the instructions were straightforward
and
I had no installation problems. I used the example from the uTidyLib home page to
be sure it
all worked in the end.
Listing 2 is the first program I worked up for taking an input file name of bad HTML and converting the contents to XHTML.
Listing 2: uTidyLib Program to Convert HTML to XHTML
import tidy import sys def tidy2xhtml(instream, outstream): options = dict(output_xhtml=1, add_xml_decl=1, indent=1 ) tidied = tidy.parseString(instream.read(), **options) tidied.write(outstream) return doc = open(sys.argv[1]) tidy2xhtml(doc, sys.stdout)
I had to read in the entire input file to read the contents as a string because uTidyLib
provides no interface for getting HTML source from a file-like object.
tidy.parse
is the other available function, but it takes a file name. This
could be inconvenient in the case of large source files. The options dictionary represents
options for the underlying Tidy, which are listed in the HTML Tidy Quick Reference. Using
the dictionary constructor idiom, options have to be provided in a form acceptable
as Python
identifiers, particularly by converting hyphens to underscores, so the Tidy option
fix-bad-comments
would be specified as fix_bad_comments
.
The result of running Listing 2 against Listing 1 is as follows:
<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for Linux/x86 (vers 1st August 2004), see www.w3.org" /> <title></title> </head> <body> Go <a class="that" href="here.html"><i>here</i></a> or <i>go <b><a href="index.html">Home</a> <!--noncetag>spam</noncetag><!==eggs--></b></i> </body> </html>
Notice how the bad comment is corrected by replacing the "--" with "==". uTidyLib
also
fills out all the half-specified elements, whether valid SGML tag minimization (it's
perfectly legal HTML to not close p
tags, for instance) or not (there is a
closing but not an opening html
tag).
I tried uTidyLib on a variety of files, usually with very nice XHTML results. I also tried with a variety of encodings, since the web crawler project I mentioned involved crawling international versions of sites. I ran into trouble as soon as I tried pages in Japanese. As an example, I use the Japanese document Hello world HTML, which is actually perfectly valid HTML that just happens to be encoded in the popular Shift-JIS encoding (there is a mix of English and Japanese in the document). Figure 1 is a bit of English/Japanese mix from the Table of Contents.
Figure 1: Sample of English and Japanese Text from Valid HTML Document
auto encoding detect probs weird encoding names |
This bullet item gets turned into the following XML by uTidyLib:
<li> <a href="hwht01.htm" accesskey="1">Section 1</a> : HTML Šî‘b‚ÌŠî‘b </li>
This would be rendered in the browser as in Figure 2:
Figure 2: Sample of English and Japanese Text from Valid HTML Doc after Mangling by uTidyLib
Clearly not what the original document author intended. It turns out that Tidy cannot
really detect the source document's encoding, even when it's properly and clearly
stated
(the document has LANG="ja"
in the html
element and <META
HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=Shift_JIS">
). Tidy just
assumes ISO-8859-1. It also turns out that Tidy outputs US-ASCII encoding by default.
I
suppose the US-ASCII default for generated XHTML is to accommodate outdated browsers
that
can't deal with UTF-8 and UTF-16. The inability to detect encodings, on the other
hand, is
unfortunate and a severe limitation. I trawled the options, and I couldn't find anything
to
turn encoding detection on, but I did find options to tell Tidy what input and output
encodings to use.
I updated the program in Listing 2 to specify an encoding for the source document by hand (e.g. "Shift-JIS") and to always produce UTF-8 output. In doing so I ran into another odd limitation in Tidy. It seems to refuse encoding names unless they are in all lowercase, with all dashes eliminated. For example, it refused "Shift-JIS" or "UTF-8", throwing an exception: "tidy.error.OptionArgError: missing or malformed argument for option: input-encoding". By trial and error I figured out that "shiftjis" and "utf8" were required for things to work and that I could not use any likely spelling of "ISO-8859-1" at all, but had to use the alternate name "latin1" instead. The updated code is in Listing 3.
Listing 3: uTidyLib Program to Convert HTML to XHTML Using Specified Encodings
import tidy import sys def tidy2xhtml(instream, outstream): options = dict(output_xhtml=1, add_xml_decl=1, indent=1, output_encoding='utf8', input_encoding=encoding ) tidied = tidy.parseString(instream.read(), **options) tidied.write(outstream) return doc = open(sys.argv[1]) try: encoding = sys.argv[2] except: encoding = 'latin1' tidy2xhtml(doc, sys.stdout)
This allows me to specify the encoding as the second command-line argument if I know it.
libxml2's HTML Parser
I'm always surprised to see what useful bits are buried in libxml2 and available through the Python binding (see my article on this topic). One of them is an HTML reader that can handle bad HTML and create a tree object that is not at all XHTML, but is at least a well-formed rendition of the source document, which is usually good enough. The following snippet illustrates this tool.
>>> import libxml2 >>> #Again seems to require the full string >>> source = open('listing1.html').read() >>> hdoc = libxml2.htmlParseDoc(source, None) HTML parser error : Opening and ending tag mismatch: html and b </html> ^ HTML parser error : Opening and ending tag mismatch: html and i </html> ^
Despite these warnings, hdoc
is a usable node at this point. It doesn't give
you DOM, but rather libxml2's specialized tree API, which, as I mentioned in an earlier
article, I find unevenly documented and hard to navigate. The libxml2 page talks about
"DOM," but I think they use the term generically, not meaning the W3C specification
and
certainly not the Python standard-library DOM conventions.
>>> print hdoc /usr/lib/python2.3/site-packages/libxml2.py:3597: \ FutureWarning: %u/%o/%x/%X of negative int will \ return a signed string in Python 2.4 and up return "<xmlDoc (%s) object at 0x%x>" % (self.name, id (self)) <xmlDoc (None) object at 0xf7032bcc>
The warning, which I got with my Python 2.3 installation, will only appear the first
time
you convert a node to string (e.g. implicitly, using print
) and seems harmless.
I assume the libxml2 crew will address any potential problems before Python 2.4 is
finalized. You can see the document libxml2 interpreted from the bad HTML by
re-serialization.
>>> print hdoc.serialize() <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p> Go <a class="that" href="here.html"><i>here</i></a> or <i>go <b><a href="index.html">Home</a> <!--noncetag>spam</noncetag><!--eggs--> </b></i></p></body></html> >>> hdoc.freeDoc()
Clearly it didn't complete the document as effectively as uTidyLib: it didn't fix
the
broken comment, the generated output document-type declaration is untenable for an
XML
document, but the result is useful nevertheless. Don't forget the freeDoc()
call, since libxml2/Python requires manual memory management.
>>> uri = 'http://www.tg.rim.or.jp/~hexane/ach/hwht/' >>> hdoc = libxml2.htmlParseFile(uri, None)
The result from re-serialization seemed to maintain the Shift-JIS content, but I got a very strange JavaScript error message when I wrote it to a file and tried to view it in Firefox. Clearly dealing with HTML files in various encodings is a difficult task that complicates any efforts to cleanly process the HTML.
Wrap Up
I've heard some other Python tools discussed for converting HTML to usable XML (or XML tree objects):
- ElementTidy uses Tidy to create XHTML in the form of an ElementTree object (see my article on the topic).
- The twisted.web.microdom module in Twisted has an option
beExtremelyLenient=True
that creates a tree from even broken HTML.
If you just need to extract information from broken HTML, there are some other options.
- The aforementioned BeautifulSoup.
- The HTML Scraper recipe on the Python Cookbook needs a lot of tweaking, based on my experience.
Thanks to the XML thread and especially participants on this thread discussing Python parsers for broken HTML. If you have other suggestions I haven't covered, please post them as comments to this article.
News and Notes
XML-SIG members and others have, as usual, been busy this last month. Mike Hostetler announced XMLBuilder 1.1. "You create an XMLBuilder object, send it some dictionary data, and it will generate the XML for you." See the announcement.
Mark Pilgrim announced the publication of his book Dive Into Python, available in its entirety online (though you should buy the physical book if you like it). Chapter 9: XML Processing is especially of interest. See the announcement.
Also in Python and XML |
|
Should Python and XML Coexist? |
|
I released Scimitar 0.6.0, an update of my ISO Schematron implementation that compiles a Schematron schema into a Python validator script. It adds support for keys, fixes diagnostic messages, and a few other things. See the announcement.
Fredrik Lundh did some time and space benchmarks of Python libraries for parsing and representing XML. It includes minidom, elementtree, PyRXPu (the only XML compliant variant of PyRXP), and pxdom. He does not specify his methodology, except that he parsed a "3.5 MB source file." More clarity on his test methods, including harness code and measurement methodology, would be nice. He plans to add xml.objectify, libxml/Python, and cDomlette.
Jarno Virtanen has posted some quick code for performing an XSL transformation in Jython. I've added this to my reference page on XSLT Python processing APIs for Python.