Skip to content

Commit

Permalink
make submission reeady
Browse files Browse the repository at this point in the history
  • Loading branch information
schochastics committed Sep 28, 2023
1 parent d264fa7 commit 4f32167
Show file tree
Hide file tree
Showing 6 changed files with 90 additions and 34 deletions.
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@
^_pkgdown\.yml$
^docs$
^pkgdown$
^cran-comments\.md$
8 changes: 5 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
Package: adaR
Title: A Fast WHATWG-compliant URL Parser
Title: A Fast WHATWG Compliant URL Parser
Version: 0.1.0.9000
Authors@R:
c(person("David", "Schoch", , "[email protected]", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-2952-4812")),
person("Chung-hong", "Chan", role = c("aut"), email = "[email protected]",
comment = c(ORCID = "0000-0002-6232-7530"))
comment = c(ORCID = "0000-0002-6232-7530")),
person("Yagiz", "Nizipli", role = c("ctb", "cph"),comment = "author of ada-url : <https://github.com/ada-url/ada>"),
person("Daniel", "Lemire", role = c("ctb", "cph"),comment = "author of ada-url : <https://github.com/ada-url/ada>")
)
Description: A wrapper for ada-url, a WHATWG-compliant and fast URL parser written in modern C++. Also contains auxiliary functions to extract public suffix.
Description: A wrapper for ada-url, a WHATWG-compliant and fast URL parser written in modern C++. Also contains auxiliary functions such as a public suffix extractor.
URL: https://schochastics.github.io/adaR/, https://github.com/schochastics/adaR
BugReports: https://github.com/schochastics/adaR/issues
License: MIT + file LICENSE
Expand Down
8 changes: 6 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# adaR 0.1.0.9000

* split C++ files h/t Chung-hong Chan (@chainsawriot)
* split C++ file to isolate original ada-url code h/t Chung-hong Chan (@chainsawriot)
* add support for public suffix extraction #14
* add support for punycode #18
* added `url_decode2` as a fast alternative to `utils::URLdecode`
* improved vectorization of `ada_get_*` and `ada_has_*` #26 and #30 h/t Chung-hong Chan (@chainsawriot)
* improved vectorization of `ada_get_*` and `ada_has_*` #26 and #30 h/t
Chung-hong Chan (@chainsawriot)
* fixed #47 h/t Chung-hong Chan (@chainsawriot)
* added `ada_get_domain()` #43


# adaR 0.1.0

Expand Down
46 changes: 32 additions & 14 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,18 @@ output: github_document

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# adaR <img src="man/figures/logo.png" align="right" height="139" alt="" />

<!-- badges: start -->
[![R-CMD-check](https://github.com/schochastics/adaR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/schochastics/adaR/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/adaR)](https://CRAN.R-project.org/package=adaR)
<!-- badges: end -->

adaR is a wrapper for [ada-url](https://github.com/ada-url/ada), a
Expand All @@ -27,6 +28,8 @@ It implements several auxilliary functions to work with urls:
- public suffix extraction (top level domain excluding private domains) like [psl](https://github.com/hrbrmstr/psl)
- fast c++ implementation of `utils::URLdecode` (~40x speedup)

More general information on URL parsing can be found in the introductory vignette via `vignette("adaR")`.

`adaR` is part of a series of R packages to analyse webtracking data:

- [webtrackR](https://github.com/schochastics/webtrackR): preprocess raw webtracking data
Expand All @@ -42,9 +45,14 @@ You can install the development version of adaR from [GitHub](https://github.com
devtools::install_github("schochastics/adaR")
```

The version on CRAN can be installed with
```r
install.packages("adaR")
```

## Example

This is a basic example which shows all the returned components of a URL
This is a basic example which shows all the returned components of a URL.

```{r example}
library(adaR)
Expand Down Expand Up @@ -75,26 +83,36 @@ ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.751984
5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
```

A "raw" url parse using ada is extremely fast (see [ada-url.com](https://www.ada-url.com/)) but the implemented interface
is not yet optimized. The performance is still very compatible with `urltools::url_parse` with the noted advantage in accuracy in some
A "raw" url parse using ada is extremely fast (see [ada-url.com](https://www.ada-url.com/)) but for this to carry over to R is tricky.
The performance is still very compatible with `urltools::url_parse` with the noted advantage in accuracy in some
practical circumstances.

```{r faster}
bench::mark(
ada = replicate(1000, ada_url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag", decode = FALSE)),
urltools = replicate(1000, urltools::url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag")),
iterations = 1, check = FALSE
ada = ada_url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag", decode = FALSE),
urltools = urltools::url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag"),
iterations = 1, check = FALSE
)
```

For further benchmark results, see `benchmark.md` in `data_raw`.

## Public Suffix extraction

`public_suffix()` extracts their top level domain from the [public suffix list](https://publicsuffix.org/), **excluding** private domains.
This functionality already exists in the R package [psl](https://github.com/hrbrmstr/psl).

psl relies on a C library and is very fast. However, the package is not on CRAN and has the C library as
system requirement. If these are no issues for you and you need that speed, please use that package.
```{r public_suffix}
urls <- c(
"https://subsub.sub.domain.co.uk",
"https://domain.api.gov.uk",
"https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)
public_suffix(urls)
```

If you are wondering about the last url. The list also contains wildcard suffixes such as `*.kawasaki.jp` which need to be matched.


## Acknowledgement

The logo is created from [this portrait](https://commons.wikimedia.org/wiki/File:Ada_Lovelace_portrait.jpg) of [Ada Lovelace](https://de.wikipedia.org/wiki/Ada_Lovelace), a very early pioneer in Computer Science.
The logo is created from [this portrait](https://commons.wikimedia.org/wiki/File:Ada_Lovelace_portrait.jpg) of [Ada Lovelace](https://de.wikipedia.org/wiki/Ada_Lovelace), a very early pioneer in Computer Science.
50 changes: 35 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
<!-- badges: start -->

[![R-CMD-check](https://github.com/schochastics/adaR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/schochastics/adaR/actions/workflows/R-CMD-check.yaml)
[![CRAN
status](https://www.r-pkg.org/badges/version/adaR)](https://CRAN.R-project.org/package=adaR)
<!-- badges: end -->

adaR is a wrapper for [ada-url](https://github.com/ada-url/ada), a
Expand All @@ -18,6 +20,9 @@ It implements several auxilliary functions to work with urls:
like [psl](https://github.com/hrbrmstr/psl)
- fast c++ implementation of `utils::URLdecode` (~40x speedup)

More general information on URL parsing can be found in the introductory
vignette via `vignette("adaR")`.

`adaR` is part of a series of R packages to analyse webtracking data:

- [webtrackR](https://github.com/schochastics/webtrackR): preprocess raw
Expand All @@ -36,9 +41,16 @@ You can install the development version of adaR from
devtools::install_github("schochastics/adaR")
```

The version on CRAN can be installed with

``` r
install.packages("adaR")
```

## Example

This is a basic example which shows all the returned components of a URL
This is a basic example which shows all the returned components of a
URL.

``` r
library(adaR)
Expand Down Expand Up @@ -89,36 +101,44 @@ ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.751984
```

A “raw” url parse using ada is extremely fast (see
[ada-url.com](https://www.ada-url.com/)) but the implemented interface
is not yet optimized. The performance is still very compatible with
[ada-url.com](https://www.ada-url.com/)) but for this to carry over to R
is tricky. The performance is still very compatible with
`urltools::url_parse` with the noted advantage in accuracy in some
practical circumstances.

``` r
bench::mark(
ada = replicate(1000, ada_url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag", decode = FALSE)),
urltools = replicate(1000, urltools::url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag")),
iterations = 1, check = FALSE
ada = ada_url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag", decode = FALSE),
urltools = urltools::url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag"),
iterations = 1, check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 ada 456ms 456ms 2.19 2.67MB 19.7
#> 2 urltools 316ms 316ms 3.16 2.59MB 22.1
#> 1 ada 469µs 469µs 2132. 2.49KB 0
#> 2 urltools 407µs 407µs 2457. 2.49KB 0
```

For further benchmark results, see `benchmark.md` in `data_raw`.

## Public Suffix extraction

`public_suffix()` extracts their top level domain from the [public
suffix list](https://publicsuffix.org/), **excluding** private domains.
This functionality already exists in the R package
[psl](https://github.com/hrbrmstr/psl).

psl relies on a C library and is very fast. However, the package is not
on CRAN and has the C library as system requirement. If these are no
issues for you and you need that speed, please use that package.
``` r
urls <- c(
"https://subsub.sub.domain.co.uk",
"https://domain.api.gov.uk",
"https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)
public_suffix(urls)
#> [1] "co.uk" "gov.uk"
#> [3] "butthisispartoftheps.kawasaki.jp"
```

If you are wondering about the last url. The list also contains wildcard
suffixes such as `*.kawasaki.jp` which need to be matched.

## Acknowledgement

Expand Down
11 changes: 11 additions & 0 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
## Initial Submission

# Test environments
* ubuntu 22.04, R 4.3.1
* win-builder (devel and release)

## R CMD check results

0 errors | 0 warnings | 1 note

* This is a new release.

0 comments on commit 4f32167

Please sign in to comment.