On Wed, Nov 2, 2011 at 10:46 AM, Tom Evans <tevans.uk@googlemail.com> wrote:
-- OK, take this example. I have a django model table with 70 millionOn Wed, Nov 2, 2011 at 5:30 PM, Ian Clelland <clelland@gmail.com> wrote:
> On Wed, Nov 2, 2011 at 8:25 AM, Thomas Guettler <hv@tbz-pariv.de> wrote:
>>
>> # This is my current solution
>> if get_special_objects().filter(pk=obj.pk).count():
>> # yes, it is special
>>
>
> I can't speak to the "why" of this situation; it seems to me that this could
> always be converted into a more efficient database query without any
> unexpected side-effects (and if I really wanted the side effects, I would
> just write "if obj in list(qs)" instead). In this case, though, I would
> usually write something like this:
> if get_special_objects().filter(pk=obj.pk).exists():
> # yes, it is special
> I believe that in some cases, the exists() query can be optimized to return
> faster than a count() aggregation, and I think that the intent of the code
> appears more clearly.
> Ian
rows in it. Doing any kind of query on this table is slow, and
typically the query is date restrained - which mysql will use as the
optimum key, meaning any further filtering is a table scan on the
filtered rows.
Pulling a large query (say, all logins in a month, ~1 million rows)
takes only a few seconds longer than counting the number of rows the
query would find - after all, the database still has to do precisely
the same amount of work, it just doesn't have to deliver the data.
Say I have a n entries I want to test are in that resultset, and I
also want to iterate through the list, calculating some data and
printing out the row, I can do the existence tests either in python or
in the database. If I do it in the database, I have n+1 expensive
queries to perform. If I do it in python, I have 1 expensive query to
perform, and (worst case) n+1 full scans of the data retrieved (and I
avoid locking the table for n+1 expensive queries).
Depending on the size of the data set, as the developer I have the
choice of which will be more appropriate for my needs. Sometimes I
need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
qs".
I agree that there are situations where you want, or need, to pull the data in to Python for processing, to avoid a lot of database overhead. That's why we have select_related, as well: sometimes you really do need to just grab as much as possible all at once.
The trouble is that querysets are *supposed* to be lazy; just evaluating as much as necessary, as late as possible, to do the job. I think that this behaviour violates the principle of least surprise, by instantiating a (potentially very large) queryset as a side-effect of a simple inclusion test.
Any other time that you want a queryset instantiated, the idiomatic way to do it is to construct a Python list based on it:
# Get all objects at once from database
objs = list(qs)
# Now use that list multiple times in a method
or
for obj in list(qs):
# qs is evaluated once, list members may be manipulated as needed in Python
or, by extension,
if obj in list(qs):
# stuff
I wouldn't rely on the behaviour of the in operator to evaluate the queryset for me; it doesn't look right to me, it's not obvious to anyone else looking at the code, and I don't think it's documented behaviour.
I would prefer that in did an exists query, but since there are explicit ways to force either behavior, in practise I use one of those explicit ways, rather than leave the code looking ambiguous.
Regards,
Ian Clelland
<clelland@gmail.com>
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
No comments:
Post a Comment