Sunday, November 3, 2013

RE: Complex query reduction

Hi

You are running into typical production Django problems - works fine in testing, terrible in production.

Firstly

If you use attributed many-to-manys, create the table outside of Django and use raw SQL.
Create a primary key on the two things being related (usually larger table first).

Multi-column indices
Two choices
  • A single index that includes both (do this outside Django)
  • Two indices, one for each column
The size will be roughly equivalent between the two

Single composite index
smaller than two indexes.
faster to update.
accelerates a more restricted range of queries.
  • An index on (A, B) does nothing for queries just using B
  • But on the queries it does accelerate, it will be faster.

Two indices
  • larger
  • slower to update
  • will help on queries on both columns, but not as much as a single composite index
  • will help accelerate queries on either of the columns
  • can be expressed directly in Django

Helpful Tool
pg_stat_activity shows a wealth of useful information - including

which indexes are actually being used

If an index is not being used (or not being used very often) drop it
if you are surprised that it is not being used, find out why

Basic Performance Rules
1. Do not iterate over QuerySets
2. If you think you have to iterate over a QuerySet, see rule #1
3. If you are absolutely, certainly, 100% positive that the only possible solution to your problem is iterating over a QuerySet, see rule #3

Iteration - A Health Warning
• Ignores Django's lazy-evaluation mechanism and copies everything into local memory
• Copies data from the database just to process it locally
• Does filtration or summation in the application that is processed more efficiently in the database.
• Databases are good at this sort of stuff let the DB do it

Alternatives to Iteration
• QuerySet.update()
• cursor.execute("UPDATE reader_hours ...")
• Stored procedures

Take in the code below

How much many objects are in memory at point A?

qs = Orders.objects.all()
 # There are about 2,500,000 rows in "orders"

for order in qs:
order.age_in_days += 1 # POINT A
 order.save()

Answer  - 2,500,000

Why Is This?
  • Django does lazy evaluation… (everyone tells me so!)
  • The Django code carefully asks for a slice of 100 objects
  • which trickles down through lots of really convoluted Python to psycopg2
  • which dutifully asks for 100 rows from Postgres
  • which sends all 2,500,000 over the wire

Solution - Named Cursors
The protocol between the Postgres client and server only does partial sends when using named cursors
  • psycopg2 fully supports named cursors
  • Django doesn't use named cursors
  • Therefore, the first time you ask for any object in a QuerySet, you get all of them
  • This is a very good reason not to ask for large result sets

IN
  • if using Django avoid the IN operation at all costs

If there are potentially more than 15 items in the list, rework the IN as a JOIN against whatever the source of the keys is


At some stage you are going to have to get to grips with SQL, might as well be sooner than later

R+C

PS some time ago one of my clients was only able to process 4 transactons a second against their very complicated DB. They needed to process 70+ transactions per second. After 3 days of code analysis we had the transaction rate up to 240 per second and so could get rid of the 6 week backlog that had developed. Front end coders didn't consider the DBMS to be important

Date: Sun, 3 Nov 2013 02:37:09 -0800
From: akaariai@gmail.com
To: django-users@googlegroups.com
Subject: Re: Complex query reduction

You should rewrite the query into a form that doesn't require distinct. In general, when you see a query that has joins and DISTINCT, that should be an alarm bell that something isn't written correctly in the query. Unfortunately Django's ORM generates such queries, and that isn't easy to fix as there are databases that like DISTINCT more than rewriting the query to use subqueries.

In any case, you should check how to write the query without the need of distinct. Something like this should work:

self.items = BibliographicRecord.objects.
listable_objects().filter(
        authored__researcher__in=researchers,
        )
self.items = BibliographicRecord.objects.
listable_objects().filter(
        pk__in=self.items.values_list('pk')
        )

But maybe you can push the __in to deeper into the authored__researches lookup...

 - Anssi

On Saturday, November 2, 2013 4:50:42 PM UTC+2, Daniele Procida wrote:
On Fri, Nov 1, 2013, Javier Guerra Giraldez <jav...@guerrag.com> wrote:

>have you tried eliminating the second IN relationship?  something like
>
>entities = entity.get_descendants()
>
>items = BibliographicRecord.objects.filter
>(authored__researcher__person__member_of__entity__in=entities).distinct()

Indeed I have, but in that form it takes around 1770ms, compared to around 1540ms in the original form. What I actually do is:

    # breaking apart the queries allows the use of values_lists
    entities = self.entity.get_descendants(
        include_self=True
        ).values_list('id', flat=True)

    # and the set() here is about 230ms faster than putting a distinct() on
    # the first query
    researchers = set(Researcher.objects.filter(
        person__entities__in=entities
        ).values_list('person', flat=True))

    self.items = BibliographicRecord.objects.listable_objects().filter(
        authored__researcher__in=researchers,
        ).distinct()

I think that's partly because this way the SELECT doesn't have to grab all the fields of publications_bibliographicrecord.

But, the real killer is the combination of ordering (in the queryset or on the model, it doesn't matter) with the distinct() - as soon as one is removed from the equation, the execution time drops to around 250ms.

That's for 55000 BibliographicRecords created by that last operation (before distinct() is applied; distinct() reduces them to 28000).

That seems excessive to me.

BibliographicRecord has a custom primary key, and its id fields look like "d9ce7e2f-663e-4fc6-8448-b214c6915aed:web-of-science". Could that be implicated in performance?

Daniele


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/8208cc22-cc94-4fe6-9245-709bdc42647f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No comments:

Post a Comment