Django talk: Re: How to get the source code of an url?

Wednesday, November 28, 2012

Re: How to get the source code of an url?

On Tue, Nov 27, 2012 at 6:17 PM, donarb <donarb@nwlink.com> wrote:
> You're not parsing XML, it's HTML and it's not well formed, for example your
> title and author tags have closing tags that don't match. Your HTML needs to
> be valid XHTML before trying to use an XML parser on it. You might want to
> try something else to parse this, like Scrapy or Beautiful Soup.
>

For parsing arbitrary html, I find that the combination of html5lib
and lxml is hard to beat:

import html5lib
from html5lib import treebuilders

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder('lxml'))
doc = parser.parse(html_str)
ns = { 'h': 'http://www.w3.org/1999/xhtml' }
li_tables = doc.xpath('//h:ul[@class="table_list"]', namespaces=ns)

Cheers

Tom

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

Django talk

Wednesday, November 28, 2012

Re: How to get the source code of an url?

No comments:

Post a Comment

Followers

Blog Archive

About Me