Saturday, May 27, 2017

Should I store offline calculation results in the cache?

Hello all,

I have an application that calculates and tells you whether a specific crop at a specific piece of land needs to be irrigated, and how much. The calculation lasts for a few seconds, so I'm doing it offline with Celery. Every two hours new meteorological data comes in and all the pieces of land are recalculated.

The question is where to store the results of the calculation. I thought that since they are re-creatable, the cache would be the appropriate place. However, there is a difference with the more common use of the cache: they are re-creatable, but they are also necessary. You can't just go and delete any item in the cache. This will cripple the website, which expects to find the calculation results in the cache. Viewing something on the site will never trigger a recalculation (and if I make it trigger, it will be a safety procedure for edge cases and not the normal way of doing things). The results must also survive reboots, so I chose the file-based cache.

I didn't know about culling, so when the pieces of land grew to 100, and the items in the cache to 400 (4 items need to be stored for each piece of land), I spent a few hours trying to find out what the heck is going on. I solved the problem by tweaking the culling parameters. However all this has raised a few issues:

  1. The filesystem cache can't grow too much because of issue 11260, which is marked wontfix. According to Russell Keith-Magee,
"the filesystem cache is intended as an easy way to test caching, not as a serious caching strategy. The default cache size and the cull strategy implemented by the file cache should make that obvious. If you need a cache capable of holding 100000 items, I strongly recommend you look at memcache. If you insist on using the filesystem as a cache, it isn't hard to subclass and extend the existing cache."
If these comments are correct, then the documentation needs some fixing, because not only does in not say that the filesystem cache is not for serious use, but it implies the opposite:
"Without a really compelling reason, ... you should stick to the cache backends included with Django. They've been well-tested and are easy to use."
Is Russell not entirely correct perhaps, or is the documentation? Or am I missing something?
  1. In the end, is it a bad idea to use the cache for this particular case? I also have a similar use case in an unrelated app: a page that needs about a minute to render. Although I've implemented a quick-and-dirty solution of increasing the web server's timeout and caching the page, I guess the correct way would be to produce that page offline with Celery or so. Where would I store such a page if not in the cache?
--   Antonis Christofides  http://djangodeployment.com

No comments:

Post a Comment