Friday, May 30, 2014

Suggestions for Paginating a Growing Django Queryset

Hello everyone!
 
      I've been stuck for about a week trying to figure this out. Been on all sorts of blogs and asked questions on stackoverflow in hopes that it would lead in the right direction. I would like to know what you think I could do.
 
Let's say I have 4 models, a User model, Blog Model, Hashtag and a Comment model. Users can create blogs, users can comment on them and blogs can be tagged by the blog owner from the comments or the blog description itself. If I we're to best describe this, think of Instagram. You can hashtag your photo when your uploading it, and afterwards by commenting on it. 
 
Here are the models:
 
#models.py
 
from django.db import models
from django.contrib.auth.models import User
 
class Blog(models.Model):
               user = models.ForeignKey(User)
               description = models.CharField(max_length=140, blank=True)
               blog_text = models.CharField(max_length=5000,  blank=True)
               related_hashtags = models.ManyToManyField('Hashtag', related_name = 'tagged_post', null=True, blank=True)
 
 
class Comment(models.Model):
               user = models.ForeignKey(User)
               comment = models.CharField(max_length=140, blank=True)
 
class Hashtag(models.Model):
               user = models.ForeignKey(User)
               comment = models.CharField(max_length=140, blank=True)
               count = models.IntegerField(default=1)
 
Pretty simple. A blog can have multiple hashtags and a hashtag can belong to multiple blogs. When a user creates a blog, I check the description string for any hashtags using the following function below:
 
 hash_tags = re.findall(r'#(\w+)', blog.description, re.UNICODE)[:30] # each blog can have a maximum of 30 hashtags
 if hash_tags:
                for hash_tag in hash_tags:
                               try:
                                              oldHashtag = Hashtag.objects.get(hashtag = hash_tag)
                                              oldHashtag.count = oldHashtag.count + 1
                                              oldHashtag.save()
                                              blog.related_hashtag.add(oldHashtag)

                               except Hashtag.DoesNotExist:
                                               newHashtag = Hashtag.objects.create(hashtag = lowercase_hash_tag)
                                               blog.related_hashtag.add(newHashtag)

I hope so far everything makes sense. I do the same as above whenever a Comment object is created.

Now let's imagine the hashtag #acdc is really popular, about 20-40 blogs are added to that hashtag about every millisecond.

If the user searches for #acdc, I would like to show the latest 20 blogs that have just been added with #acdc, and as the user loads more, I'll show the next 20. They are ordered in DESC order of insertion.

I had to resort to RAW SQL since Django couldn't order the queryset by the pk of the ManyToManyField, many others suggested using a through table, but for legacy reasons, I did it this way:

SELECT * FROM django_blog INNER JOIN django_blog_related_hashtag ON ( django_blog.id = django_blog_related_hashtag.blog_id ) INNER JOIN django_hashtag ON ( django_blog_related_hashtag.hashtag_id = django_hashtag.id ) WHERE ( django_hashtag.hashtag = 'acdc' ) ORDER BY django_blog_related_hashtag.id DESC LIMIT 20

If the user wanted to view page two, I'd simply add an OFFSET at the end of the query. A simple function like so:

OFFSET = (page_number -1) * 20

This all works great, but for web apps like Instagram or Twitter or this Blog app I just made, if the queryset is constantly growing, the user is not always in sync with the most recent data. Which is perfectly fine, I don't care about that. But what I do care about is that the next page that they request must contain the results that are suppose to come after the ones they have just received. If I don't account for this, users might see blogs that they have already seen from previous pages because of the growing queryset. Someone mentioned caching the entire queryset, but wouldn't that have a large memory over head?

I'm looking forward to hearing your thoughts and suggestions.

Kind Regards,

Riegie Godwin

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/c9dace38-4cde-4462-98bb-3ff689b14c7c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment