Sunday, January 25, 2015

Re: Find duplicates with regex

The benefit to doing stuff like this in the database is that the
database can do it smarter than you can in Python. Is that true in
this case? I think not. If we were talking raw MySQL I'd suggest
running your Website column through something like REPLACE(Website,
'://www.', '') and then the count would work reasonably well.

But we're not talking raw MySQL. And if you want to use QuerySet I
assume you're still trying to stay database agnostic. That breaks it
down for me into two cases:

1. This is a query you'll be frequently and must run quickly. In that
case, I'd say to alter your model to include a normalized_website
field that is calculated from Website. Strip the 'www.' either in an
overridden save method or a pre_save signal. And don't use regexp for
this work; urlparse and urlunparse are much more appropriate (never
use a regexp to parse structured data when you can help it). Write a
data migration to handle the existing table. Then you can run your
query over normalized_website and things will work fine.

2. This is a query you'll run infrequently and doesn't need to run
quickly. In that case, just pull out all the records and do the
count/filter with Python code.

On Sun, Jan 25, 2015 at 10:58 PM, <yakkadesign@gmail.com> wrote:
> I posted this question to stackover flow but didn't get a good answer:
> http://stackoverflow.com/questions/28080545/django-find-duplicates-with-queryset-and-regex
>
> I want to find duplicates in db fields with a regex.
>
> I know I can use this to find duplicates:
> self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)
>
> I have a model like this:
> class company(models.Model):
> Website = models.URLField(blank=True, null=True )
>
> The problem is that www and non-www websites are marked as different
> websites. I want some thing that will return duplicates where it realizes
> www and non-www are the same website.
>
> I know I can use a regex like this for www and non-www:
> Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
>
> Here is an example:
> Company.objects.create(Website='http://example.com')
> Company.objects.create(Website='http://www.example.com')
> Company.objects.create(Website='http://example.org', Name='a')
> Company.objects.create(Website='http://example.org', Name='b')
>
> When I call:
> Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)
>
> It returns:
> 1. http://example.org (from name=a) and http://example.org (from name=b)
>
> This is missing that example.com and www.example.com are the same website
> and duplicates.
>
> I want to use a regex so that I can tell django that example.com and
> www.example.com are the same websites.
>
> I want the return to be:
> 1. http://example.org (from name=a) and http://example.org (from name=b)
> 2. example.com www.example.com
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-users+unsubscribe@googlegroups.com.
> To post to this group, send email to django-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/452fad73-1319-4954-b004-7d0604705f30%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAD4ANxVaKtLC%2BEBFRibMd%3D8M1nU56RQi0Brcoiy9Vs4ErTtNjg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment