Wednesday, February 10, 2016

Re: WHERE EXISTS / WHERE NOT EXISTS clause without using QuerySet.extra

I posted a ticket here: https://code.djangoproject.com/ticket/26194 which was marked as a duplicate of https://code.djangoproject.com/ticket/25789#comment:2

It was suggested that I try Parent.objects.filter(child=None).distinct() (but that it might perform worse than the WHERE NOT EXISTS query)

Here's the plan generated by that query:

HashAggregate  (cost=216.11..216.54 rows=43 width=103) (actual time=1.553..1.562 rows=25 loops=1)
  ->  Nested Loop Anti Join  (cost=0.42..215.15 rows=43 width=103) (actual time=0.015..1.511 rows=25 loops=1)
        ->  Seq Scan on catalogue_category  (cost=0.00..104.45 rows=145 width=103) (actual time=0.006..0.154 rows=146 loops=1)
        ->  Index Only Scan using catalogue_productcategory_b583a629 on catalogue_productcategory  (cost=0.42..37.27 rows=801 width=4) (actual time=0.009..0.009 rows=1 loops=146)
              Index Cond: (category_id = catalogue_category.id)
              Heap Fetches: 87
Total runtime: 1.619 ms

This is only 3 times as slow as the carefully optimized query, which might be an acceptable performance trade-off in some cases.

On Wednesday, February 3, 2016 at 10:03:31 AM UTC-6, John P. wrote:
Greetings!

I'm trying to refactor a query to avoid using QuerySet.extra (as per the recommendation here: https://docs.djangoproject.com/en/1.9/ref/models/querysets/#extra)

Here's a simplified version of my code:

# testapp.models

class Parent(models.Model):
    pass

class Child(models.Model):
    parent = models.ForeignKey('testapp.Parent')

This is the query I am attempting to replace:

Parent.objects.extra(where=["""
    NOT EXISTS (SELECT 1 FROM testapp_child WHERE testapp_child.parent_id = testapp_parent.id)
"""])

This is extremely similar to, but not the same as

Parent.objects.exclude(id__in=Child.objects.all().values('parent_id'))

Both queries should return the same rows, but the biggest difference is that PostgreSQL's query planner produces completely different plans. The performance difference can be huge, even for a relatively modest data set. (I believe my data set has something like 270k "parent" instances and 80k "child" instances.)

I tried searching this group as well as the ticket system, and couldn't find a solution to this exact problem (other than using the alternative query).

Thanks for taking the time to read through all of this! I am more than happy to open a ticket, but I thought I should post here first.


More Technical Stuff

Here's the output of EXPLAIN ANALYZE on my actual models/data. I had to apply a LIMIT 100 because if I try to return the entire result, the second one completely hangs my computer. I bolded some relevant bits


Limit  (cost=0.71..9.70 rows=100 width=111) (actual time=0.074..0.483 rows=100 loops=1)
  ->  Merge Anti Join  (cost=0.71..22435.69 rows=249613 width=111) (actual time=0.072..0.465 rows=100 loops=1)
        Merge Cond: (catalogue_product.id = catalogue_productcategory.product_id)"
        ->  Index Scan using catalogue_product_pkey on catalogue_product  (cost=0.42..16986.70 rows=292339 width=111) (actual time=0.020..0.285 rows=150 loops=1)
        ->  Index Only Scan using catalogue_productcategory_9bea82de on catalogue_productcategory  (cost=0.29..3861.27 rows=81671 width=4) (actual time=0.037..0.070 rows=101 loops=1)
              Heap Fetches: 0
Total runtime: 0.568 ms

Here's the plan for the second query:
Limit  (cost=0.00..234229.95 rows=100 width=111) (actual time=31.087..1165.022 rows=100 loops=1)
  ->  Seq Scan on catalogue_product  (cost=0.00..342373919.34 rows=146170 width=111) (actual time=31.087..1164.992 rows=100 loops=1)
        Filter: (NOT (SubPlan 1))
        Rows Removed by Filter: 5
        SubPlan 1
          ->  Materialize  (cost=0.00..2138.07 rows=81671 width=4) (actual time=0.001..5.539 rows=80818 loops=105)
                ->  Seq Scan on catalogue_productcategory u0  (cost=0.00..1409.71 rows=81671 width=4) (actual time=0.005..10.574 rows=81671 loops=1)
Total runtime: 1165.255 ms

As you can see the former is about ~2000 times faster than the latter.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/ec4ca1dd-0136-42c4-a949-9db1145b47e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment