diff --git a/README.Rmd b/README.Rmd index c70fe2e..64411ca 100644 --- a/README.Rmd +++ b/README.Rmd @@ -13,7 +13,7 @@ knitr::opts_chunk$set( ) ``` -# adaR +# adaR [![R-CMD-check](https://github.com/schochastics/adaR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/schochastics/adaR/actions/workflows/R-CMD-check.yaml) @@ -22,6 +22,11 @@ knitr::opts_chunk$set( adaR is a wrapper for [ada-url](https://github.com/ada-url/ada), a [WHATWG](https://url.spec.whatwg.org/#url-parsing)-compliant and fast URL parser written in modern C++ . +It implements several auxilliary functions to work with urls: + +- public suffix extraction (top level domain excluding private domains) like [psl](https://github.com/hrbrmstr/psl) +- fast c++ implementation of `utils::URLdecode` (~40x speedup) + ## Installation You can install the development version of adaR from [GitHub](https://github.com/) with: @@ -33,13 +38,28 @@ devtools::install_github("schochastics/adaR") ## Example -This is a basic example which shows all the returned components +This is a basic example which shows all the returned components of a URL ```{r example} library(adaR) ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag") ``` +```c++ + /* + * https://user:pass@example.com:1234/foo/bar?baz#quux + * | | | | ^^^^| | | + * | | | | | | | `----- hash_start + * | | | | | | `--------- search_start + * | | | | | `----------------- pathname_start + * | | | | `--------------------- port + * | | | `----------------------- host_end + * | | `---------------------------------- host_start + * | `--------------------------------------- username_end + * `--------------------------------------------- protocol_end + */ +``` + It solves some problems of urltools with more complex urls. ```{r better} urltools::url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14. @@ -48,13 +68,34 @@ urltools::url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40. ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m 5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519") ``` - -```{r faster, echo=FALSE,eval=FALSE} + +A "raw" url parse using ada is extremely fast (see [ada-url.com](https://www.ada-url.com/)) but the implemented interface +is not yet optimized. The performance is still very compatible with `urltools::url_parse` with the noted advantage in accuracy in some +practical circumstances. + +```{r faster} bench::mark( ada = replicate(1000, ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE)), urltools = replicate(1000, urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")), iterations = 1, check = FALSE ) -``` +``` + +## Public Suffix extraction + +`public_suffix()` takes urls and returns their top level domain from the [public suffix list](https://publicsuffix.org/), **excluding** private domains. +This functionality already exists in the R package [psl](https://github.com/hrbrmstr/psl) and [urltools](https://cran.r-project.org/package=urltools). + +psl relies on a C library an is lighning fast. Hoewver, the package is not on CRAN and has the C lib as a +system requirement. If these are no issues for you and you need that speed, please use that package. + +the performance of urltools for that task is quite comparable to psl, but it does rely on a different set of +top level domains (to the best of our knowledge, it does include private domains). + +Overall, both packages over higher performance for this task. This comes with no surprise, since +our extractor is written in base R. Public suffix extraction is not the main objective of this package, yet +we wanted to include a function for this task without introducing new dependencies. + +## Acknowledgement + +The logo is created from [this portrait](https://commons.wikimedia.org/wiki/File:Ada_Lovelace_portrait.jpg) of [Ada Lovelace](https://de.wikipedia.org/wiki/Ada_Lovelace), a very early pioneer in Computer Science. \ No newline at end of file diff --git a/README.md b/README.md index 6dab08f..ad33190 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ -# adaR +# adaR @@ -12,6 +12,12 @@ adaR is a wrapper for [ada-url](https://github.com/ada-url/ada), a [WHATWG](https://url.spec.whatwg.org/#url-parsing)-compliant and fast URL parser written in modern C++ . +It implements several auxilliary functions to work with urls: + +- public suffix extraction (top level domain excluding private + domains) like [psl](https://github.com/hrbrmstr/psl) +- fast c++ implementation of `utils::URLdecode` (\~40x speedup) + ## Installation You can install the development version of adaR from @@ -24,7 +30,7 @@ devtools::install_github("schochastics/adaR") ## Example -This is a basic example which shows all the returned components +This is a basic example which shows all the returned components of a URL ``` r library(adaR) @@ -35,6 +41,21 @@ ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag") #> 1 password_1 example.org:8080 example.org 8080 /api ?q=1 #frag ``` +``` cpp + /* + * https://user:pass@example.com:1234/foo/bar?baz#quux + * | | | | ^^^^| | | + * | | | | | | | `----- hash_start + * | | | | | | `--------- search_start + * | | | | | `----------------- pathname_start + * | | | | `--------------------- port + * | | | `----------------------- host_end + * | | `---------------------------------- host_start + * | `--------------------------------------- username_end + * `--------------------------------------------- protocol_end + */ +``` + It solves some problems of urltools with more complex urls. ``` r @@ -59,6 +80,52 @@ ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.751984 #> 1 ``` - +A “raw” url parse using ada is extremely fast (see +[ada-url.com](https://www.ada-url.com/)) but the implemented interface +is not yet optimized. The performance is still very compatible with +`urltools::url_parse` with the noted advantage in accuracy in some +practical circumstances. + +``` r +bench::mark( + ada = replicate(1000, ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE)), + urltools = replicate(1000, urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")), + iterations = 1, check = FALSE +) +#> Warning: Some expressions had a GC in every iteration; so filtering is +#> disabled. +#> # A tibble: 2 × 6 +#> expression min median `itr/sec` mem_alloc `gc/sec` +#> +#> 1 ada 594ms 594ms 1.68 2.67MB 15.1 +#> 2 urltools 393ms 393ms 2.55 2.59MB 15.3 +``` + +## Public Suffix extraction + +`public_suffix()` takes urls and returns their top level domain from the +[public suffix list](https://publicsuffix.org/), **excluding** private +domains. This functionality already exists in the R package +[psl](https://github.com/hrbrmstr/psl) and +[urltools](https://cran.r-project.org/package=urltools). + +psl relies on a C library an is lighning fast. Hoewver, the package is +not on CRAN and has the C lib as a system requirement. If these are no +issues for you and you need that speed, please use that package. + +the performance of urltools for that task is quite comparable to psl, +but it does rely on a different set of top level domains (to the best of +our knowledge, it does include private domains). + +Overall, both packages over higher performance for this task. This comes +with no surprise, since our extractor is written in base R. Public +suffix extraction is not the main objective of this package, yet we +wanted to include a function for this task without introducing new +dependencies. + +## Acknowledgement + +The logo is created from [this +portrait](https://commons.wikimedia.org/wiki/File:Ada_Lovelace_portrait.jpg) +of [Ada Lovelace](https://de.wikipedia.org/wiki/Ada_Lovelace), a very +early pioneer in Computer Science. diff --git a/man/figures/logo.png b/man/figures/logo.png new file mode 100644 index 0000000..a8d65a8 Binary files /dev/null and b/man/figures/logo.png differ