Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added to_unicode parameter (fix #66) #67

Closed
wants to merge 2 commits into from
Closed

Conversation

schochastics
Copy link
Member

@schochastics schochastics commented Jan 30, 2024

added a parameter to turn of the unicode in charsub()

adaR::ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html",to_unicode = TRUE) 
#> [1] "http://xn--53-6kcainf4buoffq.ҖҖpҕҘ1җai/dҗoof/jҖuҔnior.html"
adaR::ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html", to_unicode = FALSE)
#> [1] "http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html"

Created on 2024-01-30 with reprex v2.1.0

@chainsawriot what do you think about this solution? Does that make sense to you?

@chainsawriot
Copy link
Collaborator

Thank you for the implementation. But there is one problem. As I said previously, this actually only affects the href in #66 . Therefore, we shouldn't change the behavior of extracting other fields.

Just as an example:

url <- "http://xn--53-6kcainf4buoffq.xn--p1ai/\u6e2f\u805e/junior-programmer.html"
adaR::ada_url_parse(url, to_unicode = TRUE)
#>                                                                              href
#> 1 http://xn--53-6kcainf4buoffq.ʾp1aʽi/\xe6\xb8ʾ\xaf\xafE8\x81ʽʽ\x9e/juniʼoʾr.html
#>   protocol username password             host         hostname port
#> 1    http:                   поверкадома53.рф поверкадома53.рф     
#>                       pathname search hash
#> 1 /港聞/junior-programmer.html
adaR::ada_url_parse(url, to_unicode = FALSE)
#>                                                                href protocol
#> 1 http://xn--53-6kcainf4buoffq.xn--p1ai/港聞/junior-programmer.html    http:
#>   username password                           host
#> 1                   xn--53-6kcainf4buoffq.xn--p1ai
#>                         hostname port                     pathname search hash
#> 1 xn--53-6kcainf4buoffq.xn--p1ai      /港聞/junior-programmer.html

Created on 2024-01-31 with reprex v2.0.2

For example, in the second case, hostname doesn't change from puny to the actual hostname; but pathname is actually in unicode (breaking the promise of to_unicode = FALSE).

My solution would be not having the parameter to_unicode. Instead, just silently converting href (or not converting, just return the original input url). Also, as the package requires R 4.2, one would expect the package would return UTF-8 by default. Therefore, to_unicode = FALSE is a bit confusing.

@schochastics
Copy link
Member Author

Hmm you are right. I will adapt accordingly, thanks!

chainsawriot added a commit that referenced this pull request Jan 31, 2024
See #68 and #67

This passes the basic tests.
chainsawriot added a commit that referenced this pull request Jan 31, 2024
* added to_unicode parameter

* added test

* Use the silent conversion approach fix #66

See #68 and #67

This passes the basic tests.

* A better test suite for #66

* More tests

* Head hygiene and default parameter

* Bürokratie

---------

Co-authored-by: schochastics <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants