Saturday, October 2, 2010

Re: how to retreive the body text alone of a webpage

thanks guys,
I tried this..

from BeautifulSoup import BeautifulSoup
import urllib

def get_page_body_text(url):
h=urllib.urlopen(url)
data=h.read()
soup=BeautifulSoup(data)
body_texts = soup.body(text=True)
text = ''.join(body_texts)
return text

...
while True:
#print 'size=%d'%len(get_page_body_text('http://
www.google.com'))
print 'size=%d'%len(get_page_body_text('http://
sampleblogbyjim.blogspot.com/'))
time.sleep(5)

when google.com is the url ,the code gets the correct length of
data.Then I tried a blog which I created for fun,
This causes the code to crash with an error


File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>",

Any idea how this can be taken care of?The blog site must be creating
bad html..How do you deal with such a problem?
thanks
jim

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

No comments:

Post a Comment