Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the form decoder with URLCodec #15

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

iku000888
Copy link

@iku000888 iku000888 commented Dec 7, 2016

  • This addresses the decoding part of the following issue
    Params middleware does not decode some strings correctly ring#269, which would fix my problem but does not address the other parts that are mentioned (encoder/custom percent encoder)

  • Would appreciate feedback whether I should address the other concerns, or this change itself could be merged in.

- This addresses the following URLdecoding issue
ring-clojure/ring#269
@iku000888
Copy link
Author

iku000888 commented Dec 7, 2016

Hm, looks like the codec library encodes/decodes utf-16 differently than the Java one.
(Long live testing!)
Perhaps this is a side effect to consider, and instead allow an optional encoder/decoder function as a input parameter. What do you think?

FAIL in (test-form-decode) (codec.clj:61)

expected: (= (form-decode "a=foo%FE%FF%00%2Fbar" "UTF-16") {"a" "foo/bar"})

  actual: (not (= {"�" "景濾＀⽢慲"} {"a" "foo/bar"})) ```

@weavejester
Copy link
Member

weavejester commented Dec 7, 2016

Yes, I noticed that as well when I tried to fix it, and it's not just UTF-16 that it encodes differently.

Java treats every non-percent-encoded character as literal, while Commons Codec treats them all as a series of bytes. So for instance, take the previous Shift-JIS encoded example you used:

"%83%82%83W%83o%83P%83R%83%8F%83C"

Notice that there's a "W", "o", "P", "R"and "C" in there. Commons Codec says "treat them as ASCII bytes" whereas Java says, "treat them as literal characters". Now, that makes me think Java is correct and Commons Codec is wrong, but you mentioned that browsers encode the same way as Commons Codec?

Is there a site I can check the Shift-JIS encoding at? Otherwise I guess I'll build a small handler with the right encoding.

@iku000888
Copy link
Author

iku000888 commented Dec 7, 2016

Found one!
http://d.hatena.ne.jp/keywordmobile/

With the example under discussion.
http://d.hatena.ne.jp/keywordsearchmobile?word=%83%82%83W%83o%83P%83R%83%8F%83C

(My perception tells me that browsers encode the Commons Codec way)
(Tested with Chrome on a Windows Machine)
(Let me know if you want more Japanese Text)

@iku000888
Copy link
Author

iku000888 commented Dec 7, 2016

And as far as the "correctness" of Java vs Codec goes...
I heard that the relevant RFC does not cover this,
which is why there is room for multi-byte string garbling to happen depending on the implementation

Reading the percent encoding section makes me think so too,
but on the other hand my literacy is limited...

https://tools.ietf.org/html/rfc3986#page-12

@zerg000000
Copy link

https://github.com/nwjs/chromium.src/blob/df7f8c8582b9a78c806a7fa1e9d3f3ba51f7a698/third_party/WebKit/Source/platform/weborigin/KURL.cpp#L628
https://github.com/nwjs/chromium.src/blob/df7f8c8582b9a78c806a7fa1e9d3f3ba51f7a698/url/url_util.cc#L646

the chromium encode flow is

  1. convert string to utf-8 string
  2. if ascii char append to output
  3. if non-ascii char convert to % hex hex and append to output

so Shift-JIS, UTF-16 encoding should not be an issue, since the string should be convert to utf-8 string before encode?

I guess the decode flow should be just reverse?

  1. assume the string is ascii string
  2. if not %, append to output
  3. if percent, capture the next two ascii char to form a utf-8 char and append to output
  4. convert output from utf-8 string to utf-16 string (java string)

@weavejester
Copy link
Member

so Shift-JIS, UTF-16 encoding should not be an issue, since the string should be convert to utf-8 string before encode?

But if that's the case we could just decode with UTF-8. The string "%83%82%83W%83o%83P%83R%83%8F%83C" represents a series of bytes encoded in Shift-JIS, not UTF-8.

user=> (.decode (URLCodec.) "%83%82%83W%83o%83P%83R%83%8F%83C" "Shift-JIS")
"モジバケコワイ"
user=> (.decode (URLCodec.) "%83%82%83W%83o%83P%83R%83%8F%83C" "UTF-8")
"���W�o�P�R���C"

@iku000888
Copy link
Author

iku000888 commented Dec 16, 2016

@weavejester

I think adding support for a pluggable decoder is the best way to go for these reasons:

  • Can be done without changing the existing decode behaviour that could cause unexpected breakage somewhere on the planet with some exotic charset.
  • Can avoid the discussion of which implementation of the decoder/encoder is correct, by letting the user make the choice.

Here is what I have cooked up as an idea,
https://github.com/iku000888/alt-params-middleware/
which I am thinking of testing it in my production app to make sure it works as expected.

Would greatly appreciate if I can get your feedback on this idea.
If you like the idea, it would involve a patch to ring.util.codec and anothet to ring.middleware.params.

@weavejester
Copy link
Member

I don't think the encoder should be pluggable. We might as well use two libraries instead. If Chrome and other browsers decode in the same way as Commons Codec does, we should adopt that form.

@iku000888
Copy link
Author

@weavejester
Alright, then!
I have updated the failing utf-16 decoding test case data by conforming to how apache codec would encode "a" and "foo/bar". Am I correct that I should also go ahead and replace the encoder fn? Is there anything else to update?

@weavejester
Copy link
Member

The url-encode and url-decode functions need to be entirely rewritten as well.

@iku000888
Copy link
Author

@weavejester With the commons implementation?

@iku000888
Copy link
Author

@weavejester So I just went ahead and replaced the implementation of url-encode and url-decode using URL codec. Please let me know if this is not what was asked.

The tests are failing in travis, but run and pass just fine on my machine with jdk8. Would you know how to reconcile this assuming I implemented everything correctly?

@weavejester
Copy link
Member

With the commons implementation?

No, they need to be rewritten to handle encoding in the same way. We still need to differentiate between form encoding and URL encoding, so I don't think we can use Commons Codec.

In practice it means we'll probably need to set up a manual loop and recur, and percent-encode and url-encode will wind up looking very similar.

When I have time I'll take a look at this and write an updated implementation. If you want to you can have a bash at it yourself, but it'll be more complex than just substituting in the Commons Codec.

(is (= (url-encode "foo+bar") "foo+bar"))
(is (= (url-encode "foo bar") "foo%20bar")))
(is (= (url-encode "foo/bar" "UTF-16") "%FE%FF%00f%00o%00o%00%2F%00b%00a%00r"))
(is (= (url-encode "foo+bar") "foo%2Bbar"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. "foo+bar" should be encoded as "foo+bar", hence the test.

(is (= (url-encode "foo bar") "foo%20bar")))
(is (= (url-encode "foo/bar" "UTF-16") "%FE%FF%00f%00o%00o%00%2F%00b%00a%00r"))
(is (= (url-encode "foo+bar") "foo%2Bbar"))
(is (= (url-encode "foo bar") "foo+bar")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"foo bar" should be encoded as "foo%20bar".

@iku000888
Copy link
Author

Thanks for the feedback and it totally makes sense.
I might slowly tackle it over the next couple of weeks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants