Tuesday, September 28, 2010

Re: notification in python

Harryos,

Den 28/09/2010 kl. 09.56 skrev harryos:

> thanks Erik,
> By 'update' I meant a major addition/removal of text(say 100
> characters).
> Initially I thought of making hash of a page and comparing it to the
> saved hash of the same page at a different moment of time..But ,this
> would
> cause even a tiny change to be considered as an update..I would like
> to use a filter to set an update of x number of characters.
> May be using f=urllib.urlopen and
> currentsize=len(f.read()) will let me find the number of added/
> removed characters..and set the filter accordingly..


Content length (which you could also get using the HTTP header "Content Length") won't necessarily tell you if content has changed. I think your problem is a candidate for http://en.wikipedia.org/wiki/Levenshtein_distance (calculating the "distance" between two strings), for which I think there are Python implementations.

Depending on your requirements, you could add other heuristics to detect major changes, e.g. load the page into an XML parser and only check certain <div>'s. But further suggestions would require more information on your problem.

Kind regards,

Erik

No comments:

Post a Comment