Saturday, October 7, 2023

Re: Reference data while testing - interesting

On 8/10/2023 2:26 am, 'Ryan Nowakowski' via Django users wrote:
Do you need tests for every chemical? Is there a subset of tests and data that would cover all the cases?

No. Only chemicals which are agreed (by experts) to be correctly categorised are used to generate tests.

I consider these to be representative. They prove the software is correct for all chemicals which have comparable characteristics. Comparable in this context means having properties falling within the same ranges as the agreed correctly categorised chemical.

We all know "similar" chemicals with even slightly different characteristics can have different degrees of hazard or even different hazards. But that isn't the objective because regulators specify those ranges and that's what governs  categorisations. There is always room for expert judgement and experts will defend their decisions.

However, particular chemicals have been assessed internationally (and/or locally) and listed with particular hazards and regulatory requirements. If the chemical being categorised by my system is one of those or happens to suddenly appear on one of those lists then action must be taken to ensure any specific precautions are adopted locally.

Part of the categorisation process involves checking all those lists.

Hence tests for the agreed correctly categorised chemicals must check those lists on every test run.

As an aside, you probably already know that most industrial chemicals have a more-or-less unique CAS (Chemical Abstracts Service) number issued by the American Chemical Society (acs.org). Most regulators use the CAS number to identify most chemicals. Otherwise, proper chemical nomenclature is used for identification instead.

As it happens, we currently have fewer than a dozen ("representative") test chemicals so it would be possible to manually manage the fixtures - but that is so quill and ink I won't consider it.

Also, we have automated generation of new test chemicals. An expert user can independently trigger auto-creation of a test chemical based on the "correctly" categorised real chemical. The next time we auto-fetch all test chemicals from the production database and automatically regenerate the tests, it will be included in the test suite.

Speaking of quill and ink, few if any of those international regulatory lists have APIs. They offer Excel or html at best. Hence we import the lists and notified changes via Django migrations and other utilities we write.

As I mentioned below, I'd like read-only access to our reference tables during test runs. I guess I have to build an API using db_alias per ChatGPT's suggestion.

Thank you Ryan for responding. It helps to rethink stuff.

Cheers

Mike



On October 5, 2023 6:53:08 PM CDT, Mike Dewhirst <miked@dewhirst.com.au> wrote:
I have an interesting testing problem which requires someone smarter than me.

I want to prove correctness of multiple outcomes from multiple regulatory computations based on, among other things, international lists of assessed chemicals and their hazards.

Hazards include both human health and environmental - land, air and aquatic as well as the food chain from microbes on up.

Once a qualified chemistry expert has agreed that our (Django) software has correctly categorised a chemical according to local regulations and those lists, we need to lock in that behaviour which, as already indicated, is based on reference data.

To do so, we wrote a (Django) management command to auto-generate a set of tests for that chemical which proves all those outcomes are as expected whenever we run our tests. That protects against unintended side effects as we develop but not against changes in the reference data.

The interesting problem is that things change. Every day, new discoveries are made which indicate individual chemicals or chemical groups are actually more hazardous than previously thought.

Reference data fixtures are infeasible (dozens of tables, hundreds of thousands of records) because they are enormous and therefore unmanageable for frequent testing. Even if we did use fixtures, if not continuously refreshed, tests just keep passing.

We need tests to fail when change happens so that our software can notify users that they have to rethink what they might be doing with a particular chemical or group of chemicals in order to avoid further damaging the planet or its inhabitants.

The base reference data is updated when things change. These are necessarily manual processes based on subscribing to emails to receive notifications from local and international regulators about upcoming changes, calls for public input and eventually publishing of actual adjustments and deadline dates. That is a continuous process.

I put this scenario to ChatGPT and prompted it down the db_alias path[1] of keeping the reference data portion of the database available read-only during testing so that tests are directly exposed to the source of truth. It came up with a complicated/complex solution which, due to underconfidence, I have not tried.

I cannot believe this is the first time a reference data source has been needed during Django testing so I'm asking here if anyone has a solution.

Many thanks

Cheers

Mike


- - - - - 
[1] Given the specific requirements of your testing scenario, where you need to access read-only reference data from a production database while emptying the regular testing database between tests, you can achieve this by using Django's db_alias feature. You can configure a separate database alias for read-only access to the reference data.

Here's how you can set it up:

  1. Configure a Separate Database for Reference Data:

    In your Django settings (settings.py), define a new database configuration for the reference data. You can specify this in the DATABASES setting using a different alias:

  • DATABASES = { 'default': { # Your regular database configuration for testing 'ENGINE': 'django.db.backends.postgresql', 'NAME': 'test_db', # Other settings... }, 'reference_db': { # Configuration for the reference data database (read-only) 'ENGINE': 'django.db.backends.postgresql', 'NAME': 'reference_db', # Other settings... }, }

    Replace 'reference_db' with an appropriate alias for your reference data.

  • Create a Router for Reference Data:

    Next, you can create a database router to determine which database to use based on the model being accessed. In your Django project, create a Python module (e.g., routers.py) and define a database router class:

  • class ReferenceDataRouter: def db_for_read(self, model, **hints): # Use the 'reference_db' alias for models that access reference data if model._meta.app_label == 'your_app_label': return 'reference_db' return None def db_for_write(self, model, **hints): # Prevent writes to the 'reference_db' database if model._meta.app_label == 'your_app_label': return None return 'default' def allow_migrate(self, db, app_label, model_name=None, **hints): # Allow migrations only on the default database if db == 'default': return True return False

    Replace 'your_app_label' with the actual app label containing your reference data models.

  • Add the Router to Your Settings:

    In your Django settings (settings.py), add the reference data router to the DATABASE_ROUTERS setting:

  • DATABASE_ROUTERS = ['your_project.routers.ReferenceDataRouter']
  • Write Tests That Access Reference Data:

    In your test cases, you can access the reference data by using the using method to specify the database alias:

    1. from django.test import TestCase from your_app.models import YourReferenceModel class YourTestCase(TestCase): def test_access_reference_data(self): reference_objects = YourReferenceModel.objects.using('reference_db').all() # Perform assertions or test your application's behavior using the reference data

    With this setup, you can read from the 'reference_db' database for models that access reference data, and writes will still occur in the regular 'default' database. This allows you to access the read-only reference data while maintaining the isolation and reset of the 'default' database between tests, as per Django's default testing behavior.















    --
    You received this message because you are subscribed to the Google Groups "Django users" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
    To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/0FA932EA-13A3-465A-AE92-F073AEC51F41%40fattuba.com.


    --   Signed email is an absolute defence against phishing. This email has  been signed with my private key. If you import my public key you can  automatically decrypt my signature and be sure it came from me. Your  email software can handle signing.  

    No comments:

    Post a Comment