Thursday, October 5, 2023

Reference data while testing - interesting

I have an interesting testing problem which requires someone smarter than me.

I want to prove correctness of multiple outcomes from multiple regulatory computations based on, among other things, international lists of assessed chemicals and their hazards.

Hazards include both human health and environmental - land, air and aquatic as well as the food chain from microbes on up.

Once a qualified chemistry expert has agreed that our (Django) software has correctly categorised a chemical according to local regulations and those lists, we need to lock in that behaviour which, as already indicated, is based on reference data.

To do so, we wrote a (Django) management command to auto-generate a set of tests for that chemical which proves all those outcomes are as expected whenever we run our tests. That protects against unintended side effects as we develop but not against changes in the reference data.

The interesting problem is that things change. Every day, new discoveries are made which indicate individual chemicals or chemical groups are actually more hazardous than previously thought.

Reference data fixtures are infeasible (dozens of tables, hundreds of thousands of records) because they are enormous and therefore unmanageable for frequent testing. Even if we did use fixtures, if not continuously refreshed, tests just keep passing.

We need tests to fail when change happens so that our software can notify users that they have to rethink what they might be doing with a particular chemical or group of chemicals in order to avoid further damaging the planet or its inhabitants.

The base reference data is updated when things change. These are necessarily manual processes based on subscribing to emails to receive notifications from local and international regulators about upcoming changes, calls for public input and eventually publishing of actual adjustments and deadline dates. That is a continuous process.

I put this scenario to ChatGPT and prompted it down the db_alias path[1] of keeping the reference data portion of the database available read-only during testing so that tests are directly exposed to the source of truth. It came up with a complicated/complex solution which, due to underconfidence, I have not tried.

I cannot believe this is the first time a reference data source has been needed during Django testing so I'm asking here if anyone has a solution.

Many thanks



- - - - - 
[1] Given the specific requirements of your testing scenario, where you need to access read-only reference data from a production database while emptying the regular testing database between tests, you can achieve this by using Django's db_alias feature. You can configure a separate database alias for read-only access to the reference data.

Here's how you can set it up:

  1. Configure a Separate Database for Reference Data:

    In your Django settings (, define a new database configuration for the reference data. You can specify this in the DATABASES setting using a different alias:

  • DATABASES = { 'default': { # Your regular database configuration for testing 'ENGINE': 'django.db.backends.postgresql', 'NAME': 'test_db', # Other settings... }, 'reference_db': { # Configuration for the reference data database (read-only) 'ENGINE': 'django.db.backends.postgresql', 'NAME': 'reference_db', # Other settings... }, }

    Replace 'reference_db' with an appropriate alias for your reference data.

  • Create a Router for Reference Data:

    Next, you can create a database router to determine which database to use based on the model being accessed. In your Django project, create a Python module (e.g., and define a database router class:

  • class ReferenceDataRouter: def db_for_read(self, model, **hints): # Use the 'reference_db' alias for models that access reference data if model._meta.app_label == 'your_app_label': return 'reference_db' return None def db_for_write(self, model, **hints): # Prevent writes to the 'reference_db' database if model._meta.app_label == 'your_app_label': return None return 'default' def allow_migrate(self, db, app_label, model_name=None, **hints): # Allow migrations only on the default database if db == 'default': return True return False

    Replace 'your_app_label' with the actual app label containing your reference data models.

  • Add the Router to Your Settings:

    In your Django settings (, add the reference data router to the DATABASE_ROUTERS setting:

  • DATABASE_ROUTERS = ['your_project.routers.ReferenceDataRouter']
  • Write Tests That Access Reference Data:

    In your test cases, you can access the reference data by using the using method to specify the database alias:

    1. from django.test import TestCase from your_app.models import YourReferenceModel class YourTestCase(TestCase): def test_access_reference_data(self): reference_objects = YourReferenceModel.objects.using('reference_db').all() # Perform assertions or test your application's behavior using the reference data

    With this setup, you can read from the 'reference_db' database for models that access reference data, and writes will still occur in the regular 'default' database. This allows you to access the read-only reference data while maintaining the isolation and reset of the 'default' database between tests, as per Django's default testing behavior.

    No comments:

    Post a Comment