Monday, February 1, 2016

Re: Kind'a TL, but please DR - Need your thoughts



On Sun, Jan 31, 2016 at 8:51 PM, Mario R. Osorio <nimbiotics@gmail.com> wrote:

I need comments on an application I have been recently proposed. The way it is being envisioned at this moment is this:


One python daemon will be listening for different communications media such as email, web and SMS (web also). IMHO, it is necessary to have a daemon per each media. This daemon(s) will only make sure the messages are received from a validated source and put such messages in a DB



So this is effectively a feed aggregation engine. I would recommend having a separate daemon running per media source, so that issues with one media source do not affect the operations of another. It would be possible to do everything with one daemon, but would be much trickier to implement.
 

A second(?) python daemon would be waiting for those messages to be in the DB, process them, act accordingly to the objective of the application, and update the DB as expected. This process(es) might included complicated and numerous mathematical calculations, which might take seconds and even minutes to process.


Implementation here is less critical than your workflow design. This could be implemented as a simple cron script on the host that runs every few minutes. The trick is to determine whether or not a) records have already been processed, b) certain records are currently processing, c) records are available that have yet to be processed/examined. You can use extra DB columns with the data to flag whether or not a process has already started examining that row, so any subsequent calls to look for new data can ignore those rows, even if the data hasn't finished processing. 

If you want something with a bit more control than cron, I would recommend a Celery instance that can be controlled/inspected by a Django installation.


A third(?) python daemon would be in charge of replying to the original message with the obtained results, but there might be other media channels involved, eg the message was received from a given email or SMS user, but the results have to be sent to multiple other email/SMS users.


This is the same as the previous question, just with a different task. 
 


The reason I want to do the application using Django is that all this HAS to have multiple web interfaces and, at the end of the day most media will come through web, and have to be processed as http requests. Also, Django gives me a frame to make this work better organized and clean and I can make the application(s) DB agnostic.


What do you mean by 'multiple web interfaces'? You mean multiple daemons running on different listening ports? Different sites using the sites framework? End-user browser vs. API? 

If most of your data arrives via HTTP calls, then a single HTTP instance should handle that fine. Your views can shovel data wherever it needs to go.
 
As far as being DB agnostic, I'm assuming that you mean the feed sources don't need to know the DB backend you are using? Django isn't really doing anything special that any other framework can't. 


Wanting the application to be DB agnostic does not mean that I don't have a choice: I know I have many options to communicate among different python processes, but I prefer to leave that to the DBMS. Of the open source DBMS I know of, only Firebird and PostgreSQL have event that can provide the communication between all the processes involved. I was able to create a very similar application in 2012 with Firebird, but this time I am being restricted to PostgreSQL, which I don't to oppose at all. That application did not involve http requests.


Prefer to leave what to the DBMS? The DBMS is responsible for storing and indexing data, not process management. Some DBMS' may have some tricks to perform such tasks, but I wouldn't necessarily want to rely on them unless really necessary. If you're going to the trouble of writing separate listening daemons, then they can talk to whatever backend you choose with the right drivers.

Choose the database based on feature set, compatibility with your host systems, and perhaps even benchmarks based on the type of data you may be storing (short life, high-read rate vs. long term low read volume single table data vs. sporadic read/write, etc.). Some databases handle certain situations better than others (i.e. If you are using UUID's rather than integers for primary keys, Postgres would likely be better than MySQL since it has special optimization for indexing UUID fields). 
 


My biggest concern at this point is this:

If most (if not all) requests to the application are going to be processed as http requests, what will happen to pending requests when one of them takes too long to reply? Is this something to be solved at the application level or at the server level?


Hence my comments about workflow. You'll need to decide the proper timers and what happens to that data. If you mean the operating system level when you mention 'server level', the OS will only manage the raw connection details themselves (such as the TCP timeouts, etc.). The data that is being processed is irrelevant to the OS. Your application needs to make the determination about the behavior when a connection times out, or a source takes too long to provide data (perhaps the host is keeping the TCP connection alive, but is not sending any data). That's up to you to decide. Your application should be aware of all of those scenarios and should act according to your workflow design, which would include timers and default behavior for all actions within the entire application. 

You'll need to design the entire application such that you do not have data inconsistencies, and what to do when those inconsistencies are encountered. The contents of the data will drive the requirements. Can you throw away an update from a feed if you only get one out of two pages that you were expecting? Or do you keep the single page and somehow factor that in to your other calculations? Will that ruin search results? 


This is as simple as I can put it. Any thoughts, comments, criticism or recommendations are welcome.



It's difficult to put simply because you are not necessarily describing a simple system. I'm sure it will become even more complex as you get further through the design process once you've gathered requirements. Requirements and workflow design will drive your implementation to meet the goals. For example, you would install Celery because you have a requirement to run the task 4 times a day normally, but would also like to trigger a on-demand task at any time. With cron, you can't (easily) do that. 

Figure out 'what' you want to do first, then figure out 'how'. It's easy to jump straight into implementation, but with a system as complex as the one you're describing, you should spend a bit of time with a flow chart tool and coming up with use cases before anyone hits a keyboard.

-James

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CA%2Be%2BciVEf1fgK3aGaouCwkFC1-K87qGg3We_k_Yb%3D6J%2BBB2MAQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment