Sunday, January 25, 2015

Find duplicates with regex

I posted this question to stackover flow but didn't get a good answer: http://stackoverflow.com/questions/28080545/django-find-duplicates-with-queryset-and-regex

I want to find duplicates in db fields with a regex.

I know I can use this to find duplicates:
self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

I have a model like this:
class company(models.Model):
   Website = models.URLField(blank=True, null=True )

The problem is that www and non-www websites are marked as different websites.  I want some thing that will return duplicates where it realizes www and non-www are the same website.

I know I can use a regex like this for www and non-www:
Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

Here is an example:
Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')
Company.objects.create(Website='http://example.org', Name='a')
Company.objects.create(Website='http://example.org', Name='b')

When I call: Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

It returns:
1.  http://example.org (from name=a) and http://example.org (from name=b)

This is missing that example.com and www.example.com are the same website and duplicates.

I want to use a regex so that I can tell django that example.com and www.example.com are the same websites.

I want the return to be:
1.  http://example.org (from name=a) and http://example.org (from name=b)
2.  example.com www.example.com
   

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/452fad73-1319-4954-b004-7d0604705f30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment