Monday, January 31, 2011

XPath at HTML in Python

This technique allows to treat HTML code as XML (if even HTML is not totally valid) and use XPath expressions over it.
from lxml import etree

content = '... some html ...'
# use the HTML parser explicitly to provide encoding
parser = etree.HTMLParser(encoding='utf-8')
# load the content using the parser
tree = etree.fromstring(content, parser)
# we've got a XML tree from HTML
# now get all links in the doc
links = tree.xpath(".//*/a")
for link in links:
    href = link.get('href') # get tag's attribute
    name = link.text() # text between open and close tags

Some links:
API reference
Usage tutorial

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.