Friday, December 30, 2016

Re: Django-Channels strange slowdowns - how to profile?

Wow, that's an incredibly slow network you have there. I would agree that you should move them all to the same place; I would say you want no more than 10ms latency (0.01s) to your Redis server in a case like this.

Andrew

On Thu, Dec 29, 2016 at 2:31 PM, <jakub.skalecki@pvpc.eu> wrote:
I've prepared this simple script to measure roundtrip time (redis on remote host):

def check_time(path):                                                                           
    t = time.time()                                                                             
    channel_layer.send(u"http.request", {"reply_channel": u"http.response.test", "path": path, "method": "GET", "headers": {'host': 'localhost'}})
    response = channel_layer.receive_many([u'http.response.test'])                                         
    print str(time.time() - t) + '   ' +  str(len(response[1]['content']) if response and response[1] else '')

Results are:
>> check_time('/health')  # really simple http endpoint
0.140397071838
>>> check_time('/health')
0.132459163666   27
>>> check_time('/health')
0.158210992813   27
>>> check_time('/health')
0.132560014725   27
>>> check_time('/health')
0.133027076721   27
>>> check_time('/health')
0.141057014465   27
>>> check_time('/health')
0.141141891479   27

But, when I'm making request to '/' endpoint, it takes more time (probably data transfer)

>>> check_time('/')
0.191249132156   21797
>>> check_time('/')
0.192593812943   21797
>>> check_time('/')
0.18672990799   21797
>>> check_time('/')
0.186357021332   21797
>>> check_time('/')
0.177523851395   21797
>>> check_time('/')
0.190500974655   21797
>>> check_time('/')
0.193222045898   21797

I know that received message is acually response to previous check_time call, but it shouldn't matter. Looks like redis request takes a lot of time. For example, same requests done to local redis instance:

>>> check_time('/')
0.000913858413696   21755
>>> check_time('/')
0.000931024551392   21755
>>> check_time('/')
0.000808000564575   21755
>>> check_time('/')
0.00170421600342   21755
>>> check_time('/')
0.00079607963562   21755
>>> check_time('/')
0.000899076461792   21755

So, I think that best what can be done is to move all docker containers to the same datacenter, because latency seems to be a problem... Do you agree with me?

W dniu czwartek, 29 grudnia 2016 13:51:34 UTC+1 użytkownik jakub.s...@pvpc.eu napisał:
Thank you for your response. I'll do what you suggested regarding basic consumer, and post my results here.

Sadly, Docker Swarm and distributed hosts are much harder to debug than "standard" deployment. Currently, full way of request is like this:

External load balancer (5 hosts) -> Docker Swarm overlay network routing to nginx -> nginx routing to one of Daphne servers (again using docker overlay network) -> Daphne puts new http request to redis queue for processing -> worker takes request, prepares response, and sends it back to redis -> Daphne takes response and sends it to nginx -> nginx returns it to user. 

Each of this steps introduces some latency. Also response size is close to 23000 bytes. Time spent inside django view is about 0.02-0.03s. Not all hosts are inside the same datacenter. 

I've prepared this setup because of high spikes of traffic on weekends, where users generate about 80-100 rps. My last load tests showed that we can handle right now ~200, but sometimes (quite rarely) something goes bad and there is a major slowdown. It may be Docker fault, yesterday I found out that only one of our Daphne servers was used, others were idle. Now it's fixed. Previously I also tried gunicorn alongside ws workers, but quite suprising, it was slower than http workers. 

Andrew, I'll try to do what you suggested and post my results here. Also, maybe you have some data what is a good proportion of workers / daphne servers? ws workers / http workers? Some general advices about tuning up redis / daphne / workers? Each of them can use one CPU core, it's my swarm settings. Yesterday I checked that one interface server can handle up to 150 simultaneous connections, some of them idle (I know that's only my case), bigger values caused slowdowns. Not that much, but maybe it's my fault. 

Ouch, and If you would like to check our site, it's https://beta.pvpc.eu

W dniu czwartek, 29 grudnia 2016 11:53:12 UTC+1 użytkownik Andrew Godwin napisał:
Hi Jakub,

There's no full profiling path yet - this is something I hope to work on as we start integrating ASGI code at work early next year.

400ms for a request is already very long; I'm slightly concerned by that (unless that's normal without Channels installed as well).

If you want to do a basic profiling test, you can just write a very basic consumer that sends something onto the reply channel and then:

1) Directly inject requests into the channel layer from a python script, listen for the response, and time the roundtrip. You can import `channel_layer` from your project's `asgi.py` - if you don't have one the docs talk about how to make one. You'll want to call channel_layer.send("http.request", {"reply_channel": "http.response!test", "path": ....}) and channel_layer.receive_many(["http.response!test"]).

2) Directly make HTTP requests on the server running Daphne to it via localhost with a similar round trip measurement. Something like `ab` works well for this; remember to either target a very simple view or compare it with a similar test using a WSGI server.

Andrew

On Wed, Dec 28, 2016 at 7:31 PM, <jakub.s...@pvpc.eu> wrote:

My question from StackOverflow:


My technology stack is Redis as a channels backend, Postgresql as a database, Daphne as an ASGI server, Nginx in front of a whole application. Everything is deployed using Docker Swarm, with only Redis and Database outside. I have about 20 virtual hosts, with 20 interface servers, 40 http workers and 20 websocket workers. Load balancing is done using Ingress overlay Docker network. Ouch, and I'm also using django-cacheops for low timeout caching (up to 15 min)


The problem is, sometimes very weird things happen regarding performance. Most of requests are handled in under 400ms, but sometimes request can take up to 2-3s, even during very small load. Profiling workers with Django Debug Toolbar or middleware-based profilers shows nothing (timing 50-100ms or so)


My question: is there any good method of profiling a whole request path with django-channels? I would like how much time each phase takes, i.e when request was processed by Daphne, when worker started processing, when it finished, when interface server sent response to the client. Currently, I have no idea how to solve this.


EDIT:

Some more information:

Website isn't that big, it tops at 1k users online, with some real-time features like notifications, chat, and real-time updates of few views. It looks like sometimes website is "blocked", like communication between redis / interface servers / workers were incredibly slooow. 


Maybe you have some experience about server / worker ratios, fine tuning channel and redis settings? 



--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/18cd97f8-5505-470b-8c22-3d97b1586d83%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/5bfd4bda-104b-4470-ad72-f49b42876cdf%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAFwN1uoVyWjG2FY3dqUo5p83xkizXR%2BXnKJE6OtVF58E4M1VSQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment