Thursday, May 2, 2019

Re: architecture for Blue/Green deployments

Hi Dan,

I recently went through a similar exercise to the one you describe to move our prototype code on AWS.

First, my background includes a stint building a control plane for autoscaling VMs on OpenStack (and being generally long in tooth), but this is my first attempt at a Web App, and therefore Django too. I also grew up on VAXes, so the notion of an always-up cluster is deeply rooted.

Technical comments follow inline...

On Wed, 1 May 2019 at 21:35, <dansmood@gmail.com> wrote:
My organization is moving into the AWS cloud, and with some other projects using MongoDB, ElasticSearch and a web application framework that is not Django, we've had no problem.

I'm our "Systems/Applications Architect", and some years ago I helped choose Django over some other solutions.   I stand by that as correct for us, but our Cloud guys want to know how to do Blue/Green deployments, and the more I look at it the less happy I am.

Here's the problem:
  • Django's ORM has long shielded developers from simple SQL problems like "SELECT * FROM fubar ..." and "INSERT INTO fubar VALUES (...)" sorts of problems.
  • However, if an existing "Blue" deployment knows about a column, it will try to retrieve it:
    • fubar = Fubar.objects.get(name='Samwise') 
  • If a new "Green" deployment is coming up, and we want to wait until Selenium testing has passed, we have the problem of migrations
I really don't see any simple way around a new database cluster/instance when we bring up a new cluster, with something like this:
  • Mark the live database as "in maintenance mode".    The application now will not write to the database, but we can also make that user's access read/only to preserve this.
  • Take a snapshot
  • Restore the snapshot to build the new database instance/cluster.
  • Make the new database as "live", e.g. clear "maintenance mode".   If t he webapp user is read-only, they are restored to full read/write permissions.
  • Run migrations in production
  • Bring up new auto-scaling group
We are not yet doing auto-scaling but otherwise your description applies very well to us. Right now, we have a pair of VMs, a "logic" VM hosting Django, and a "db" VM hosting Postgres (long term, we may move to Aurora for the database, but we are not there right now). The logic VM is based on an Ubuntu base image, but a load of extra stuff:
  • Django, our code and all Python dependencies
  • A whole host of non-Python dependencies starting with RabbitMQ (needed for Celery), nginx, etc
  • And a whole lot of configuration for the above (starting with production keys, passwords and the like)
The net result is that not only does it take 10-15 minutes for AWS to spin up a new db VM from a snapshot, but it also takes several minutes to spin gup, install, and configure the logic VM. So, we have a piece of code that can do a "live-to-<scenario>" upgrade:
  • Where scenario is "live-to-live" or "live-to-test".
  • The logic is the same in both except for a couple of small pieces only in the live-to-live case where we:
    • Pause the live system (db and celery queues) before snapshotting it for the new spin up
    • Create an archive of the database
    • Switch the Elastic IP on successful sanity test pass
  • We also have a small piece of run-time code in our project/settings.py that, on a live system, enables HTTPS and so on.
Before we do the "live-to-live" upgrade, we always to a "live-to-test" upgrade. This ensure we have run all migrations and pre-release sanities on virtually current data, and then perform a *separate* live-to-live.

While this works, it creates a window during when the service must be down. There is also a finite window when all those 3rd party dependencies on apt and pip/pypi expose the "live-to-live" to potential failure.

So in the "long term", I would prefer to attempt something like the following:
  • Use a cluster of logic N VMs.
  • Use an LB at the front end.
  • Enforce a development process that ensures that (roughly speaking) all database changes result in a new column, and where the old cannot be removed until a later update cycle. All migrations populate the new column.
  • We spin up and N+1th VM with the new logic, and once sanity testing is passed, switch the N+1 machine on in the LB, and remove one of the original N.
    • Loop
  • Delete the old column
Of course, the $64k question is all around how to keep the old logic and the new logic in sync with the two columns. For that, I can only wave my arms at present and say that the old column cannot really be there in its bare form, instead there will be some kind of a view that makes it look like it is - possibly with some old school stored procedure/trigger logic in support. Of course, I would love it if there were some magic tooling developed by the Django and database gurus before I have to tackle this. Then again, I don't believe in magic. And nor do I believe we'll have an army of devs to fake the magic.

I'd love to be shown a better way...(e.g. a complete second cluster, with a rolling migration of data from old to new until the old is killed?) else I'll be on the hook for making the above work!

Thanks, Shaheed
 
Of course, some things that Django does really help:
  • The database migration is first tested by the test process, which runs migrations
  • The unit tests have succeeded before we try to do this migration.

Does anyone have experience/cloud operations design with doing Bluegreen deployments with Django?

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/ae5310c6-b69f-43af-a838-5dce7bd6a712%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAHAc2jdcYNo%3DqrCW772h-rKJRCdUMsn%2B5tPJH%2BTOGFHGiTedqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment