Wednesday, November 2, 2011

Re: QuerySet: "if obj in queryset" can be very slow



On Wed, Nov 2, 2011 at 10:46 AM, Tom Evans <tevans.uk@googlemail.com> wrote:
On Wed, Nov 2, 2011 at 5:30 PM, Ian Clelland <clelland@gmail.com> wrote:
> On Wed, Nov 2, 2011 at 8:25 AM, Thomas Guettler <hv@tbz-pariv.de> wrote:
>>
>> # This is my current solution
>> if get_special_objects().filter(pk=obj.pk).count():
>>    # yes, it is special
>>
>
> I can't speak to the "why" of this situation; it seems to me that this could
> always be converted into a more efficient database query without any
> unexpected side-effects (and if I really wanted the side effects, I would
> just write "if obj in list(qs)" instead). In this case, though, I would
> usually write something like this:
> if get_special_objects().filter(pk=obj.pk).exists():
>    # yes, it is special
> I believe that in some cases, the exists() query can be optimized to return
> faster than a count() aggregation, and I think that the intent of the code
> appears more clearly.
> Ian

OK, take this example. I have a django model table with 70 million
rows in it. Doing any kind of query on this table is slow, and
typically the query is date restrained - which mysql will use as the
optimum key, meaning any further filtering is a table scan on the
filtered rows.

Pulling a large query (say, all logins in a month, ~1 million rows)
takes only a few seconds longer than counting the number of rows the
query would find - after all, the database still has to do precisely
the same amount of work, it just doesn't have to deliver the data.

Say I have a n entries I want to test are in that resultset, and I
also want to iterate through the list, calculating some data and
printing out the row, I can do the existence tests either in python or
in the database. If I do it in the database, I have n+1 expensive
queries to perform. If I do it in python, I have 1 expensive query to
perform, and (worst case) n+1 full scans of the data retrieved (and I
avoid locking the table for n+1 expensive queries).

Depending on the size of the data set, as the developer I have the
choice of which will be more appropriate for my needs. Sometimes I
need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
qs".

I agree that there are situations where you want, or need, to pull the data in to Python for processing, to avoid a lot of database overhead. That's why we have select_related, as well: sometimes you really do need to just grab as much as possible all at once. 

The trouble is that querysets are *supposed* to be lazy; just evaluating as much as necessary, as late as possible, to do the job. I think that this behaviour violates the principle of least surprise, by instantiating a (potentially very large) queryset as a side-effect of a simple inclusion test.

Any other time that you want a queryset instantiated, the idiomatic way to do it is to construct a Python list based on it:

# Get all objects at once from database
objs = list(qs)
# Now use that list multiple times in a method

or

for obj in list(qs):
  # qs is evaluated once, list members may be manipulated as needed in Python

or, by extension,

if obj in list(qs):
  # stuff

I wouldn't rely on the behaviour of the in operator to evaluate the queryset for me; it doesn't look right to me, it's not obvious to anyone else looking at the code, and I don't think it's documented behaviour.

I would prefer that in did an exists query, but since there are explicit ways to force either behavior, in practise I use one of those explicit ways, rather than leave the code looking ambiguous.

--
Regards,
Ian Clelland
<clelland@gmail.com>

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

No comments:

Post a Comment