Monday, November 09, 2009

Parsing HTML in Python with BeautifulSoup

I got into a spat with Eric Raymond the other day about some code he's written called ForgePlucker. I took a look at the source code and posted saying it looks like a total hack job by a poor programmer.

Raymond replied by posting a blog entry in which he called me a poor fool and snotty kid.

So far so good. However, he hadn't actually fixed the problems I was talking about (and which I still think are the work of a poor programmer). This morning I checked and he's removed two offending lines that I was talking about and done some code rearrangement. The function that had caught my eye initially was one to parse data from an HTML table which he does with this code:

def walk_table(text):
"Parse out the rows of an HTML table."
rows = []
while True:
oldtext = text
# First, strip out all attributes for easier parsing
text = re.sub('<TR[^>]+>', '<TR>', text, re.I)
text = re.sub('<TD[^>]+>', '<TD>', text, re.I)
# Case-smash all the relevant HTML tags, we won't be keeping them.
text = text.replace("</table>", "</TABLE>")
text = text.replace("<td>", "<TD>").replace("</td>", "</TD>")
text = text.replace("<tr>", "<TR>").replace("</tr>", "</TR>")
text = text.replace("<br>", "<BR>")
# Yes, Berlios generated \r<BR> sequences with no \n
text = text.replace("\r<BR>", "\r\n")
# And Berlios generated doubled </TD>s
# (This sort of thing is why a structural parse will fail)
text = text.replace("</TD></TD>", "</TD>")
# Now that the HTML table structure is canonicalized, parse it.
if text == oldtext:
break
end = text.find("</TABLE>")
if end > -1:
text = text[:end]
while True:
m = re.search(r"<TR>\w*", text)
if not m:
break
start_row = m.end(0)
end_row = start_row + text[start_row:].find("</TR>")
rowtxt = text[start_row:end_row]
rowtxt = rowtxt.strip()
if rowtxt:
rowtxt = rowtxt[4:-5]# Strip off <TD> and </TD>
rows.append(re.split(r"</TD>\s*<TD>", rowtxt))
text = text[end_row+5:]
return rows

The problem with writing code like that is maintenance. It's got all sorts of little assumptions and special cases. Notice how it can't cope with a mixed case <TD> tag? Or how there's a special case for handling a doubled </TD>?

A much better approach is to use an HTML parser than knows all about the foibles of real HTML in the real world (Raymond's main argument in his blog posting is that you can't rely on the HTML structure to give you semantic information---I actually agree with that, but don't agree that throwing the baby out with the bath water is the right approach). If you use such an HTML parser you eliminate all the hassles you had maintaining regular expressions for all sorts of weird HTML situations, dealing with case, dealing with HTML attributes.

Here's the equivalent function written using the BeautifulSoup parser:

def walk_table2(text):
"Parse out the rows of an HTML table."
soup = BeautifulSoup(text)
return [ [ col.renderContents() for col in row.findAll('td') ]
for row in soup.find('table').findAll('tr') ]

In Raymond's code above he includes a little jab at this style saying:

# And Berlios generated doubled </TD>s
# (This sort of thing is why a structural parse will fail)
text = text.replace("</TD></TD>", "</TD>")

But that doesn't actually stand up to scrutiny. Try it and see. BeautifulSoup handles the extra </TD> without any special cases.

Bottom line: parsing HTML is hard, don't make it harder on yourself by deciding to do it yourself.

Disclaimer: I am not an experienced Python programmer, there could be a nicer way to write my walk_table2 function above, although I think it's pretty clear what it's doing.

Labels:

8 Comments:

OpenID brianlane.com said...

Excellent job! I've done things both ways in the past (mostly due to lack of good parsers at the time), and prefer using BeautifulSoup -- I know I can't come up with all the possible exceptions myself.

3:29 PM  
Blogger Jim Robert said...

I agree with your assessment of beautiful soup - less code is usually better code

5:18 PM  
Blogger peterbe said...

I agree with you. Why even be interested in structure when you can get the meaning straight away.

PS. If you use lxml and BeautifulSoup you can use CSS to extract meaning from a broken HTML document. Come Eric! Catch up!

7:01 PM  
Blogger Shadow14l said...

Beautifulsoup is a generalized library. I have used regex for specific matches within strings within tags of HTML code. In large files, repeated parsing slows down considerably. A proper regex string can fit certain specifics better for your unique code.

1:02 AM  
OpenID yorksranter said...

All my projects contain BeautifulSoup at some point; it's fantastically great.

11:49 AM  
Blogger rwenderlich said...

I used Beautiful Soup for the first time last week - loved it, it made what I was trying to do super easy.

2:56 PM  
Blogger Metalshark said...

Afraid that for uses such as Google App Engine the overhead of Beautiful Soup are too much.

Find that using RE.VERBOSE and grouping (?P<>) what is required helps with maintainability.

4:54 PM  
Blogger hendrik said...

I've done quite a bit of scraping in Python, the bulk of it using PyParsing, sometimes in combination with BeautifulSoup. If you haven't heard of PyParsing, I suggest you have a try: http://pyparsing.wikispaces.com/

2:40 PM  

Post a Comment

Links to this post:

Create a Link

<< Home