Django talk: Re: QueryDict and unicode

Wednesday, October 4, 2017

Re: QueryDict and unicode

On Oct 2, 2017 1:53 PM, "Alexey Lozickiy" <wrestlingmegafan@gmail.com> wrote:

Hi all,

Why is it so that QueryDict for PY3 handles input query string different from PY2 (part of __init__ of QueryDict from Django 1.11.5):

if six.PY3: if isinstance(query_string, bytes): # query_string normally contains URL-encoded data, a subset of ASCII. try: query_string = query_string.decode(encoding) except UnicodeDecodeError: # ... but some user agents are misbehaving :-( query_string = query_string.decode('iso-8859-1') for key, value in limited_parse_qsl(query_string, **parse_qsl_kwargs): self.appendlist(key, value) else: for key, value in limited_parse_qsl(query_string, **parse_qsl_kwargs): try: value = value.decode(encoding) except UnicodeDecodeError: value = value.decode('iso-8859-1') self.appendlist(force_text(key, encoding, errors='replace'), value)

Firstly, for PY3 decoding is done only once, for entire query string, while for PY2 query is parsed first, and then each value is decoded separately.
Secondly, for PY3 query_string is being decoded only if it is of bytes type. Why there is no such check for PY2? Why not to decode only if it's not unicode?

I'm probably unqualified to answer this, but I'll try anyway.

The difference likely comes down to the change in string handling in Python 3. Py3 makes a distinction between character strings and byte strings.

The limited_parse_qsl() likely can/will only handle Unicode-escaped (URL-encoded) character strings in Py3, as opposed to handling byte strings transparently (that decode to Unicode-escaped strings) in Py2. I'm guessing that the magic implicit translation/decoding between bytes and characters no longer occurs in Py3 (for good and well-documented reasons), so care must be taken to perform the decoding manually. You'll notice that the values are not decoded a second time for Py3.

With such implementation it is not possible to pass unicode object that contains non-ascii characters to QueryDict.

Given the first comment in the code, if the data is not properly URL encoded to begin with, then I would expect that the parsing function for the values to explode, meaning that you can't pass a true Unicode string with characters beyond the ASCII range because it isn't expected at this stage. To me, that's expected and desired behavior since a QueryDict is expecting to be provided with a properly formatted/URL-encoded query.

The fix would be to URL-encode your true

Unicode string prior to passing it to a QueryDict. That should allow support of Unicode characters with higher code points.

Basically, the Internet revolves around ASCII being the lowest common denominator.

Someone please correct me if I'm wrong.

-James

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CA%2Be%2BciWQH_sQdoboTGtPmZ-28i9ihXguhOST662vR-eg%2B%2BYB9w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Django talk

Wednesday, October 4, 2017

Re: QueryDict and unicode

No comments:

Post a Comment

Followers

Blog Archive

About Me