Sunday, December 17, 2017

Re: Model Design Questions



On Dec 16, 2017 4:46 PM, "Mark Phillips" <mark@phillipsmarketing.biz> wrote:
I am building an electronic document management system using django and mysql. The purpose of this system is to catalog and retrieve electronic documents. There is no inventory of the physical documents. There are about 3,000 documents (photos, pdfs, videos) now, and that number could double in the next year. The purpose of the site is to search for a set of documents that share some given set of metadata values. I am wrestling with the model design for the metadata part of this project.

I have identified about 30 internal metadata types particular to these documents, and not related to any standards. If I use Dublin Core, there are 71 types, but I would not need all of them, and there is some overlap with the 30 internal metadata types - say another 25 types. Schema.org also has an interesting set of metadata tags, and some may be applicable to this project, and using that standard as well, could bring the number of types to 75.

While I'm not familiar with these collections of metadata values that you mentioned, if there is any chance that you'll need such standardization for interoperability with external systems and/or standardized reporting, I'd recommend going with the one that best fits your data. Amending extra columns would give you the flexibility to add other attributes that don't fit a particular standard.


As I see it, I have three options.

Option 1 - traditional database design, with metadata types as columns, and values in rows. The document model would have approx 50 - 80 fields, and one could expect out of 3,000 documents about 50-70% of the values are Null. 

This option is the most straight forward to create. I have no idea what the performance would be for django and mysql, searching for a set of documents in a model with 50-80 fields and 3,000 entries, where 60% are Null. Is this a feasible design, or do I need to think of another way to model this metadata? Will this scale to 6,000 documents? What if I need to add another metadata standard to the site in the future? (I am not looking for job security in this project)

With the numbers you are taking about here, you'll see zero performance hit. Circle back around when you get close to hundreds of columns and/or millions of rows.

Just be sure that you have heavy indexing enabled on all columns that will be commonly used in search queries. This sounds like a high-volume read and low-volume write database, which is the best kind. At the size you're talking about, you can probably have the entire thing cached in memory and make it super fast.



Option 2 - some sort of EAV design, with a table that has document_id, metadatatype_id, and metadatavalue_id. I have read all sorts of posts about the evils of EAV, but they seem to have peaked about 8 years ago, and now I am seeing some posts that even recommend this design approach. 

Is this approach a better fit for my particular application? Will it scale to the numbers of documents and metadata I am thinking about? I really don't want to go to the dark side if I can help it. 

I have no experience with EAV design, but from a quick read, your concerns about scale appear to be the opposite of what you think. I don't think you have enough data to introduce the complexity of EAV and see any benefit. EAV is designed for highly optimized Big Data databases that have millions or billions of rows.


Option 3 - A hybrid solution that I have also read about. Put the more common metadata types as fields in the model (columns in the database table), and then JSON-ize the rest of the metadata for that document in a string field in the model. This seems to put a rather large burden on the application code when searching for a set of documents with a particular set of metadata values. 

The performance cost of this method seems to depend heavily on the distribution of metadata types in the entire document set, and being smart enough to pick the best common ones. If there were a set of metadata types that span 90% of the the documents, then this would probably be a good solution. Span only 25% of the documents, and there is a lot of comparing values in the JSON string. 

This one has potential, but it would need to be a judgement call on your part. If you are hitting the JSON attributes for more than 25% (or whatever threshold you choose) of your standard searches, then you'll need to add more columns.


Thanks for any insights/recommendations you can provide!

Have you thought of using a document store database like MongoDB? It would appear your use case is perfect for it. Of course you wouldn't have the advantage of the Django ORM, but I wouldn't let that keep you from using the right tool. There should be swaths of sites on it and dealing with your exact use case. To me, trying to crowbar this in to an RDBMS (even Postgres with HSTORE support) seems rigid, and would force you to customize every single piece of the search process. Other indexing tools like Elastisearch or Haystack may also contribute. When dealing with the actual data, don't let Django be the limiting factor. You can still use MySQL/Postgres for the non-IP data like user accounts and session management.

-James

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CA%2Be%2BciXmRU17FY3um8rCqumUqvWudFODq2Yq2G7TjWyrynrLMA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment