Sunday, November 3, 2013

Re: Complex query reduction

You should rewrite the query into a form that doesn't require distinct. In general, when you see a query that has joins and DISTINCT, that should be an alarm bell that something isn't written correctly in the query. Unfortunately Django's ORM generates such queries, and that isn't easy to fix as there are databases that like DISTINCT more than rewriting the query to use subqueries.

In any case, you should check how to write the query without the need of distinct. Something like this should work:

self.items = BibliographicRecord.objects.
listable_objects().filter(
        authored__researcher__in=researchers,
        )
self.items = BibliographicRecord.objects.
listable_objects().filter(
        pk__in=self.items.values_list('pk')
        )

But maybe you can push the __in to deeper into the authored__researches lookup...

 - Anssi

On Saturday, November 2, 2013 4:50:42 PM UTC+2, Daniele Procida wrote:
On Fri, Nov 1, 2013, Javier Guerra Giraldez <jav...@guerrag.com> wrote:

>have you tried eliminating the second IN relationship?  something like
>
>entities = entity.get_descendants()
>
>items = BibliographicRecord.objects.filter
>(authored__researcher__person__member_of__entity__in=entities).distinct()

Indeed I have, but in that form it takes around 1770ms, compared to around 1540ms in the original form. What I actually do is:

    # breaking apart the queries allows the use of values_lists
    entities = self.entity.get_descendants(
        include_self=True
        ).values_list('id', flat=True)

    # and the set() here is about 230ms faster than putting a distinct() on
    # the first query
    researchers = set(Researcher.objects.filter(
        person__entities__in=entities
        ).values_list('person', flat=True))

    self.items = BibliographicRecord.objects.listable_objects().filter(
        authored__researcher__in=researchers,
        ).distinct()

I think that's partly because this way the SELECT doesn't have to grab all the fields of publications_bibliographicrecord.

But, the real killer is the combination of ordering (in the queryset or on the model, it doesn't matter) with the distinct() - as soon as one is removed from the equation, the execution time drops to around 250ms.

That's for 55000 BibliographicRecords created by that last operation (before distinct() is applied; distinct() reduces them to 28000).

That seems excessive to me.

BibliographicRecord has a custom primary key, and its id fields look like "d9ce7e2f-663e-4fc6-8448-b214c6915aed:web-of-science". Could that be implicated in performance?

Daniele

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/8208cc22-cc94-4fe6-9245-709bdc42647f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No comments:

Post a Comment