Saturday, December 16, 2017

Model Design Questions

I am building an electronic document management system using django and mysql. The purpose of this system is to catalog and retrieve electronic documents. There is no inventory of the physical documents. There are about 3,000 documents (photos, pdfs, videos) now, and that number could double in the next year. The purpose of the site is to search for a set of documents that share some given set of metadata values. I am wrestling with the model design for the metadata part of this project.

I have identified about 30 internal metadata types particular to these documents, and not related to any standards. If I use Dublin Core, there are 71 types, but I would not need all of them, and there is some overlap with the 30 internal metadata types - say another 25 types. Schema.org also has an interesting set of metadata tags, and some may be applicable to this project, and using that standard as well, could bring the number of types to 75.

As I see it, I have three options.

Option 1 - traditional database design, with metadata types as columns, and values in rows. The document model would have approx 50 - 80 fields, and one could expect out of 3,000 documents about 50-70% of the values are Null. 

This option is the most straight forward to create. I have no idea what the performance would be for django and mysql, searching for a set of documents in a model with 50-80 fields and 3,000 entries, where 60% are Null. Is this a feasible design, or do I need to think of another way to model this metadata? Will this scale to 6,000 documents? What if I need to add another metadata standard to the site in the future? (I am not looking for job security in this project)

Option 2 - some sort of EAV design, with a table that has document_id, metadatatype_id, and metadatavalue_id. I have read all sorts of posts about the evils of EAV, but they seem to have peaked about 8 years ago, and now I am seeing some posts that even recommend this design approach. 

Is this approach a better fit for my particular application? Will it scale to the numbers of documents and metadata I am thinking about? I really don't want to go to the dark side if I can help it. 

Option 3 - A hybrid solution that I have also read about. Put the more common metadata types as fields in the model (columns in the database table), and then JSON-ize the rest of the metadata for that document in a string field in the model. This seems to put a rather large burden on the application code when searching for a set of documents with a particular set of metadata values. 

The performance cost of this method seems to depend heavily on the distribution of metadata types in the entire document set, and being smart enough to pick the best common ones. If there were a set of metadata types that span 90% of the the documents, then this would probably be a good solution. Span only 25% of the documents, and there is a lot of comparing values in the JSON string. 

Thanks for any insights/recommendations you can provide!

Mark

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAEqej2MYmfqaA1Gq7nByjGo%3DcxF6_x7fbz4cuRy2EgrCBmuDVA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment