make submission reeady

schochastics · schochastics · commit 4f32167bf277 · 2023-09-28T13:12:00.000+02:00
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -10,3 +10,4 @@
 ^_pkgdown\.yml$
 ^docs$
 ^pkgdown$
+^cran-comments\.md$
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,13 +1,15 @@
 Package: adaR
-Title: A Fast WHATWG-compliant URL Parser
+Title: A Fast WHATWG Compliant URL Parser
 Version: 0.1.0.9000
 Authors@R: 
     c(person("David", "Schoch", , "david@schochastics.net", role = c("aut", "cre"),
            comment = c(ORCID = "0000-0003-2952-4812")),
       person("Chung-hong", "Chan", role = c("aut"), email = "chainsawtiney@gmail.com",
-	   comment = c(ORCID = "0000-0002-6232-7530"))
+	   comment = c(ORCID = "0000-0002-6232-7530")),
+      person("Yagiz", "Nizipli", role = c("ctb", "cph"),comment = "author of ada-url : <https://github.com/ada-url/ada>"),
+      person("Daniel", "Lemire", role = c("ctb", "cph"),comment = "author of ada-url : <https://github.com/ada-url/ada>") 
      )
-Description: A wrapper for ada-url, a WHATWG-compliant and fast URL parser written in modern C++. Also contains auxiliary functions to extract public suffix.
+Description: A wrapper for ada-url, a WHATWG-compliant and fast URL parser written in modern C++. Also contains auxiliary functions such as a public suffix extractor.
 URL: https://schochastics.github.io/adaR/, https://github.com/schochastics/adaR
 BugReports: https://github.com/schochastics/adaR/issues
 License: MIT + file LICENSE
diff --git a/NEWS.md b/NEWS.md
@@ -1,10 +1,14 @@
 # adaR 0.1.0.9000
 
-* split C++ files h/t Chung-hong Chan (@chainsawriot)
+* split C++ file to isolate original ada-url code h/t Chung-hong Chan (@chainsawriot)
 * add support for public suffix extraction #14
 * add support for punycode #18
 * added `url_decode2` as a fast alternative to `utils::URLdecode` 
-* improved vectorization of `ada_get_*` and `ada_has_*` #26 and #30 h/t Chung-hong Chan (@chainsawriot)
+* improved vectorization of `ada_get_*` and `ada_has_*` #26 and #30 h/t
+  Chung-hong Chan (@chainsawriot) 
+* fixed #47 h/t Chung-hong Chan (@chainsawriot)
+* added `ada_get_domain()` #43
+
 
 # adaR 0.1.0
 
diff --git a/README.Rmd b/README.Rmd
@@ -6,17 +6,18 @@ output: github_document
 
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
-  collapse = TRUE,
-  comment = "#>",
-  fig.path = "man/figures/README-",
-  out.width = "100%"
+    collapse = TRUE,
+    comment = "#>",
+    fig.path = "man/figures/README-",
+    out.width = "100%"
 )
 ```
 
 # adaR <img src="man/figures/logo.png" align="right" height="139" alt="" />
 
 <!-- badges: start -->
 [![R-CMD-check](https://github.com/schochastics/adaR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/schochastics/adaR/actions/workflows/R-CMD-check.yaml)
+[![CRAN status](https://www.r-pkg.org/badges/version/adaR)](https://CRAN.R-project.org/package=adaR)
 <!-- badges: end -->
 
 adaR is a wrapper for [ada-url](https://github.com/ada-url/ada), a
@@ -27,6 +28,8 @@ It implements several auxilliary functions to work with urls:
 - public suffix extraction (top level domain excluding private domains) like [psl](https://github.com/hrbrmstr/psl)
 - fast c++ implementation of `utils::URLdecode` (~40x speedup)
 
+More general information on URL parsing can be found in the introductory vignette via `vignette("adaR")`.
+
 `adaR` is part of a series of R packages to analyse webtracking data:
 
 - [webtrackR](https://github.com/schochastics/webtrackR): preprocess raw webtracking data
@@ -42,9 +45,14 @@ You can install the development version of adaR from [GitHub](https://github.com
 devtools::install_github("schochastics/adaR")
 ```
 
+The version on CRAN can be installed with
+```r
+install.packages("adaR")
+```
+
 ## Example
 
-This is a basic example which shows all the returned components of a URL
+This is a basic example which shows all the returned components of a URL.
 
 ```{r example}
 library(adaR)
@@ -75,26 +83,36 @@ ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.751984
    5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
 ```
 
-A "raw" url parse using ada is extremely fast (see [ada-url.com](https://www.ada-url.com/)) but the implemented interface
-is not yet optimized. The performance is still very compatible with `urltools::url_parse` with the noted advantage in accuracy in some
+A "raw" url parse using ada is extremely fast (see [ada-url.com](https://www.ada-url.com/)) but for this to carry over to R is tricky. 
+The performance is still very compatible with `urltools::url_parse` with the noted advantage in accuracy in some
 practical circumstances.
 
 ```{r faster}
 bench::mark(
-  ada = replicate(1000, ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE)),
-  urltools = replicate(1000, urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")),
-  iterations = 1, check = FALSE
+    ada = ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE),
+    urltools = urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag"),
+    iterations = 1, check = FALSE
 )
 ```
 
+For further benchmark results, see `benchmark.md` in `data_raw`.
+
 ## Public Suffix extraction
 
 `public_suffix()` extracts their top level domain from the [public suffix list](https://publicsuffix.org/), **excluding** private domains. 
-This functionality already exists in the R package [psl](https://github.com/hrbrmstr/psl).
 
-psl relies on a C library and is very fast. However, the package is not on CRAN and has the C library as 
-system requirement. If these are no issues for you and you need that speed, please use that package. 
+```{r public_suffix}
+urls <- c(
+    "https://subsub.sub.domain.co.uk",
+    "https://domain.api.gov.uk",
+    "https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
+)
+public_suffix(urls)
+```
+
+If you are wondering about the last url. The list also contains wildcard suffixes such as `*.kawasaki.jp` which need to be matched.
+
 
 ## Acknowledgement
 
-The logo is created from [this portrait](https://commons.wikimedia.org/wiki/File:Ada_Lovelace_portrait.jpg) of [Ada Lovelace](https://de.wikipedia.org/wiki/Ada_Lovelace), a very early pioneer in Computer Science.
+The logo is created from [this portrait](https://commons.wikimedia.org/wiki/File:Ada_Lovelace_portrait.jpg) of [Ada Lovelace](https://de.wikipedia.org/wiki/Ada_Lovelace), a very early pioneer in Computer Science.
diff --git a/README.md b/README.md
@@ -6,6 +6,8 @@
 <!-- badges: start -->
 
 [![R-CMD-check](https://github.com/schochastics/adaR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/schochastics/adaR/actions/workflows/R-CMD-check.yaml)
+[![CRAN
+status](https://www.r-pkg.org/badges/version/adaR)](https://CRAN.R-project.org/package=adaR)
 <!-- badges: end -->
 
 adaR is a wrapper for [ada-url](https://github.com/ada-url/ada), a
@@ -18,6 +20,9 @@ It implements several auxilliary functions to work with urls:
   like [psl](https://github.com/hrbrmstr/psl)
 - fast c++ implementation of `utils::URLdecode` (~40x speedup)
 
+More general information on URL parsing can be found in the introductory
+vignette via `vignette("adaR")`.
+
 `adaR` is part of a series of R packages to analyse webtracking data:
 
 - [webtrackR](https://github.com/schochastics/webtrackR): preprocess raw
@@ -36,9 +41,16 @@ You can install the development version of adaR from
 devtools::install_github("schochastics/adaR")
 ```
 
+The version on CRAN can be installed with
+
+``` r
+install.packages("adaR")
+```
+
 ## Example
 
-This is a basic example which shows all the returned components of a URL
+This is a basic example which shows all the returned components of a
+URL.
 
 ``` r
 library(adaR)
@@ -89,36 +101,44 @@ ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.751984
 ```
 
 A “raw” url parse using ada is extremely fast (see
-[ada-url.com](https://www.ada-url.com/)) but the implemented interface
-is not yet optimized. The performance is still very compatible with
+[ada-url.com](https://www.ada-url.com/)) but for this to carry over to R
+is tricky. The performance is still very compatible with
 `urltools::url_parse` with the noted advantage in accuracy in some
 practical circumstances.
 
 ``` r
 bench::mark(
-  ada = replicate(1000, ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE)),
-  urltools = replicate(1000, urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")),
-  iterations = 1, check = FALSE
+    ada = ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE),
+    urltools = urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag"),
+    iterations = 1, check = FALSE
 )
-#> Warning: Some expressions had a GC in every iteration; so filtering is
-#> disabled.
 #> # A tibble: 2 × 6
 #>   expression      min   median `itr/sec` mem_alloc `gc/sec`
 #>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
-#> 1 ada           456ms    456ms      2.19    2.67MB     19.7
-#> 2 urltools      316ms    316ms      3.16    2.59MB     22.1
+#> 1 ada           469µs    469µs     2132.    2.49KB        0
+#> 2 urltools      407µs    407µs     2457.    2.49KB        0
 ```
 
+For further benchmark results, see `benchmark.md` in `data_raw`.
+
 ## Public Suffix extraction
 
 `public_suffix()` extracts their top level domain from the [public
 suffix list](https://publicsuffix.org/), **excluding** private domains.
-This functionality already exists in the R package
-[psl](https://github.com/hrbrmstr/psl).
 
-psl relies on a C library and is very fast. However, the package is not
-on CRAN and has the C library as system requirement. If these are no
-issues for you and you need that speed, please use that package.
+``` r
+urls <- c(
+    "https://subsub.sub.domain.co.uk",
+    "https://domain.api.gov.uk",
+    "https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
+)
+public_suffix(urls)
+#> [1] "co.uk"                            "gov.uk"                          
+#> [3] "butthisispartoftheps.kawasaki.jp"
+```
+
+If you are wondering about the last url. The list also contains wildcard
+suffixes such as `*.kawasaki.jp` which need to be matched.
 
 ## Acknowledgement
 
diff --git a/cran-comments.md b/cran-comments.md
@@ -0,0 +1,11 @@
+## Initial Submission
+
+# Test environments
+* ubuntu 22.04, R 4.3.1
+* win-builder (devel and release)
+
+## R CMD check results
+
+0 errors | 0 warnings | 1 note
+
+* This is a new release.