Saturday, October 2, 2010

Re: how to retreive the body text alone of a webpage

You can use BeautifulSoup to parse the page. That will result in a BeautifulSoup object from which you can get the text of any element.

Simple example:

soup = BeautifulSoup.BeautifulSoup(html_string)

#find a div with class 'header3' containing the text 'Locations'
location_header = soup.find('div', attrs = {'class': 'header3'}, text = 'Locations')

#possibly suiting your purposes
body = soup.find('body')

body_length = len(body.text)


Shawn

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

No comments:

Post a Comment