Replace the form decoder with URLCodec #15

iku000888 · 2016-12-07T05:07:05Z

This addresses the decoding part of the following issue
Params middleware does not decode some strings correctly ring#269, which would fix my problem but does not address the other parts that are mentioned (encoder/custom percent encoder)
Would appreciate feedback whether I should address the other concerns, or this change itself could be merged in.

- This addresses the following URLdecoding issue ring-clojure/ring#269

iku000888 · 2016-12-07T08:04:44Z

Ｈｍ, looks like the codec library encodes/decodes utf-16 differently than the Java one.
(Long live testing!)
Perhaps this is a side effect to consider, and instead allow an optional encoder/decoder function as a input parameter. What do you think?

FAIL in (test-form-decode) (codec.clj:61)

expected: (= (form-decode "a=foo%FE%FF%00%2Fbar" "UTF-16") {"a" "foo/bar"})

  actual: (not (= {"�" "景濾＀⽢慲"} {"a" "foo/bar"})) ```

weavejester · 2016-12-07T11:19:19Z

Yes, I noticed that as well when I tried to fix it, and it's not just UTF-16 that it encodes differently.

Java treats every non-percent-encoded character as literal, while Commons Codec treats them all as a series of bytes. So for instance, take the previous Shift-JIS encoded example you used:

"%83%82%83W%83o%83P%83R%83%8F%83C"

Notice that there's a "W", "o", "P", "R"and "C" in there. Commons Codec says "treat them as ASCII bytes" whereas Java says, "treat them as literal characters". Now, that makes me think Java is correct and Commons Codec is wrong, but you mentioned that browsers encode the same way as Commons Codec?

Is there a site I can check the Shift-JIS encoding at? Otherwise I guess I'll build a small handler with the right encoding.

iku000888 · 2016-12-07T15:34:15Z

Found one!
http://d.hatena.ne.jp/keywordmobile/

With the example under discussion.
http://d.hatena.ne.jp/keywordsearchmobile?word=%83%82%83W%83o%83P%83R%83%8F%83C

(My perception tells me that browsers encode the Commons Codec way)
(Tested with Chrome on a Windows Machine)
(Let me know if you want more Japanese Text)

iku000888 · 2016-12-07T15:44:54Z

And as far as the "correctness" of Java vs Codec goes...
I heard that the relevant RFC does not cover this,
which is why there is room for multi-byte string garbling to happen depending on the implementation

Reading the percent encoding section makes me think so too,
but on the other hand my literacy is limited...

https://tools.ietf.org/html/rfc3986#page-12

zerg000000 · 2016-12-13T11:39:41Z

https://github.com/nwjs/chromium.src/blob/df7f8c8582b9a78c806a7fa1e9d3f3ba51f7a698/third_party/WebKit/Source/platform/weborigin/KURL.cpp#L628
https://github.com/nwjs/chromium.src/blob/df7f8c8582b9a78c806a7fa1e9d3f3ba51f7a698/url/url_util.cc#L646

the chromium encode flow is

convert string to utf-8 string
if ascii char append to output
if non-ascii char convert to % hex hex and append to output

so Shift-JIS, UTF-16 encoding should not be an issue, since the string should be convert to utf-8 string before encode?

I guess the decode flow should be just reverse?

assume the string is ascii string
if not %, append to output
if percent, capture the next two ascii char to form a utf-8 char and append to output
convert output from utf-8 string to utf-16 string (java string)

weavejester · 2016-12-13T13:23:12Z

so Shift-JIS, UTF-16 encoding should not be an issue, since the string should be convert to utf-8 string before encode?

But if that's the case we could just decode with UTF-8. The string "%83%82%83W%83o%83P%83R%83%8F%83C" represents a series of bytes encoded in Shift-JIS, not UTF-8.

user=> (.decode (URLCodec.) "%83%82%83W%83o%83P%83R%83%8F%83C" "Shift-JIS")
"モジバケコワイ"
user=> (.decode (URLCodec.) "%83%82%83W%83o%83P%83R%83%8F%83C" "UTF-8")
"���W�o�P�R���C"

iku000888 · 2016-12-16T18:06:49Z

@weavejester

I think adding support for a pluggable decoder is the best way to go for these reasons:

Can be done without changing the existing decode behaviour that could cause unexpected breakage somewhere on the planet with some exotic charset.
Can avoid the discussion of which implementation of the decoder/encoder is correct, by letting the user make the choice.

Here is what I have cooked up as an idea,
https://github.com/iku000888/alt-params-middleware/
which I am thinking of testing it in my production app to make sure it works as expected.

Would greatly appreciate if I can get your feedback on this idea.
If you like the idea, it would involve a patch to ring.util.codec and anothet to ring.middleware.params.

weavejester · 2016-12-16T18:39:53Z

I don't think the encoder should be pluggable. We might as well use two libraries instead. If Chrome and other browsers decode in the same way as Commons Codec does, we should adopt that form.

iku000888 · 2016-12-17T15:59:51Z

@weavejester
Alright, then!
I have updated the failing utf-16 decoding test case data by conforming to how apache codec would encode "a" and "foo/bar". Am I correct that I should also go ahead and replace the encoder fn? Is there anything else to update?

weavejester · 2016-12-17T16:53:36Z

The url-encode and url-decode functions need to be entirely rewritten as well.

iku000888 · 2016-12-17T17:01:49Z

@weavejester With the commons implementation?

- test data has been updated accordingly

iku000888 · 2016-12-17T17:44:33Z

@weavejester So I just went ahead and replaced the implementation of url-encode and url-decode using URL codec. Please let me know if this is not what was asked.

The tests are failing in travis, but run and pass just fine on my machine with jdk8. Would you know how to reconcile this assuming I implemented everything correctly?

weavejester · 2016-12-17T18:08:19Z

With the commons implementation?

No, they need to be rewritten to handle encoding in the same way. We still need to differentiate between form encoding and URL encoding, so I don't think we can use Commons Codec.

In practice it means we'll probably need to set up a manual loop and recur, and percent-encode and url-encode will wind up looking very similar.

When I have time I'll take a look at this and write an updated implementation. If you want to you can have a bash at it yourself, but it'll be more complex than just substituting in the Commons Codec.

weavejester · 2016-12-17T18:10:39Z

test/ring/util/test/codec.clj

- (is (= (url-encode "foo+bar") "foo+bar"))
- (is (= (url-encode "foo bar") "foo%20bar")))
+ (is (= (url-encode "foo/bar" "UTF-16") "%FE%FF%00f%00o%00o%00%2F%00b%00a%00r"))
+ (is (= (url-encode "foo+bar") "foo%2Bbar"))


This is incorrect. "foo+bar" should be encoded as "foo+bar", hence the test.

weavejester · 2016-12-17T18:10:58Z

test/ring/util/test/codec.clj

- (is (= (url-encode "foo bar") "foo%20bar")))
+ (is (= (url-encode "foo/bar" "UTF-16") "%FE%FF%00f%00o%00o%00%2F%00b%00a%00r"))
+ (is (= (url-encode "foo+bar") "foo%2Bbar"))
+ (is (= (url-encode "foo bar") "foo+bar")))


"foo bar" should be encoded as "foo%20bar".

iku000888 · 2016-12-17T19:02:21Z

Thanks for the feedback and it totally makes sense.
I might slowly tackle it over the next couple of weeks...

Replace the form decoder with URLCodec

8794399

- This addresses the following URLdecoding issue ring-clojure/ring#269

iku000888 force-pushed the master branch from 6b6f581 to 8794399 Compare December 7, 2016 05:08

iku000888 mentioned this pull request Dec 7, 2016

Params middleware does not decode some strings correctly ring-clojure/ring#269

Open

Update the failing test to comply with how commons codec would encode

51b52d7

iku000888 added 2 commits December 18, 2016 02:26

Replace the implementation of url-decode and url-encode

ad55f5f

- test data has been updated accordingly

Re-run travis

9584410

iku000888 added 2 commits December 18, 2016 02:46

Remove dev profile in an attempt to fix travis test

1785df0

Do not test with default profile

614d64b

weavejester requested changes Dec 17, 2016

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace the form decoder with URLCodec #15

Replace the form decoder with URLCodec #15

iku000888 commented Dec 7, 2016 •

edited

iku000888 commented Dec 7, 2016 •

edited

weavejester commented Dec 7, 2016 •

edited

iku000888 commented Dec 7, 2016 •

edited

iku000888 commented Dec 7, 2016 •

edited

zerg000000 commented Dec 13, 2016

weavejester commented Dec 13, 2016

iku000888 commented Dec 16, 2016 •

edited

weavejester commented Dec 16, 2016

iku000888 commented Dec 17, 2016

weavejester commented Dec 17, 2016

iku000888 commented Dec 17, 2016

iku000888 commented Dec 17, 2016

weavejester commented Dec 17, 2016

weavejester Dec 17, 2016

weavejester Dec 17, 2016

iku000888 commented Dec 17, 2016

Replace the form decoder with URLCodec #15

Are you sure you want to change the base?

Replace the form decoder with URLCodec #15

Conversation

iku000888 commented Dec 7, 2016 • edited

iku000888 commented Dec 7, 2016 • edited

weavejester commented Dec 7, 2016 • edited

iku000888 commented Dec 7, 2016 • edited

iku000888 commented Dec 7, 2016 • edited

zerg000000 commented Dec 13, 2016

weavejester commented Dec 13, 2016

iku000888 commented Dec 16, 2016 • edited

weavejester commented Dec 16, 2016

iku000888 commented Dec 17, 2016

weavejester commented Dec 17, 2016

iku000888 commented Dec 17, 2016

iku000888 commented Dec 17, 2016

weavejester commented Dec 17, 2016

weavejester Dec 17, 2016

Choose a reason for hiding this comment

weavejester Dec 17, 2016

Choose a reason for hiding this comment

iku000888 commented Dec 17, 2016

iku000888 commented Dec 7, 2016 •

edited

iku000888 commented Dec 7, 2016 •

edited

weavejester commented Dec 7, 2016 •

edited

iku000888 commented Dec 7, 2016 •

edited

iku000888 commented Dec 7, 2016 •

edited

iku000888 commented Dec 16, 2016 •

edited