adaR

adaR is a wrapper for ada-url, a WHATWG-compliant and fast URL parser written in modern C++ .

It implements several auxilliary functions to work with urls:

public suffix extraction (top level domain excluding private domains) like psl
fast c++ implementation of utils::URLdecode (~40x speedup)

More general information on URL parsing can be found in the introductory vignette via vignette("adaR").

adaR is part of a series of R packages to analyse webtracking data:

webtrackR: preprocess raw webtracking data
domainator: classify domains
adaR: parse urls

Installation

You can install the development version of adaR from GitHub with:

# install.packages("devtools")
devtools::install_github("gesistsa/adaR")

The version on CRAN can be installed with

install.packages("adaR")

Example

This is a basic example which shows all the returned components of a URL.

library(adaR)
ada_url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag")
#>                                                      href
#> 1 https://user_1:[email protected]:8080/api?q=1#frag
#>   protocol username   password             host
#> 1   https:   user_1 password_1 example.org:8080
#>      hostname port pathname search  hash
#> 1 example.org 8080     /api   ?q=1 #frag

  /*
   * https://user:[email protected]:1234/foo/bar?baz#quux
   *       |     |    |          | ^^^^|       |   |
   *       |     |    |          | |   |       |   `----- hash_start
   *       |     |    |          | |   |       `--------- search_start
   *       |     |    |          | |   `----------------- pathname_start
   *       |     |    |          | `--------------------- port
   *       |     |    |          `----------------------- host_end
   *       |     |    `---------------------------------- host_start
   *       |     `--------------------------------------- username_end
   *       `--------------------------------------------- protocol_end
   */

It solves some problems of urltools with more complex urls.

urltools::url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.
   7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
#>   scheme                            domain port
#> 1  https 40.7519848,-74.0015045,14.\n   7z <NA>
#>                                                                                 path
#> 1 data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   parameter fragment
#> 1      <NA>     <NA>

ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m
   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
#>                                                                                                                                                                         href
#> 1 https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   protocol username password           host       hostname
#> 1   https:                   www.google.com www.google.com
#>   port
#> 1     
#>                                                                                                                                               pathname
#> 1 /maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   search hash
#> 1

A “raw” url parse using ada is extremely fast (see ada-url.com) but for this to carry over to R is tricky. The performance is still compatible with urltools::url_parse with the noted advantage in accuracy in some practical circumstances.

bench::mark(
  ada = ada_url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag", decode = FALSE),
  urltools = urltools::url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag"),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 ada          9.43µs   10.5µs    90598.        0B     9.06
#> 2 urltools   102.25µs  108.1µs     9143.        0B    16.3

For further benchmark results, see benchmark.md in data_raw.

There are four more groups of functions available to work with url parsing:

ada_get_*() get a specific component
ada_has_*() check if a specific component is present
ada_set_*() set a specific component from URLS
ada_clear_*() remove a specific component from URLS

Public Suffix extraction

public_suffix() extracts their top level domain from the public suffix list, excluding private domains.

urls <- c(
  "https://subsub.sub.domain.co.uk",
  "https://domain.api.gov.uk",
  "https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)
public_suffix(urls)
#> [1] "co.uk"                           
#> [2] "gov.uk"                          
#> [3] "butthisispartoftheps.kawasaki.jp"

If you are wondering about the last url. The list also contains wildcard suffixes such as *.kawasaki.jp which need to be matched.

Acknowledgement

The logo is created from this portrait of Ada Lovelace, a very early pioneer in Computer Science.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.binder		.binder
.github		.github
R		R
data-raw		data-raw
man		man
src		src
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.covrignore		.covrignore
.editorconfig		.editorconfig
.gitignore		.gitignore
CITATION.cff		CITATION.cff
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
_quarto.yml		_quarto.yml
codecov.yml		codecov.yml
cran-comments.md		cran-comments.md
methodshub.qmd		methodshub.qmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

adaR

Installation

Example

Public Suffix extraction

Acknowledgement

About

Licenses found

Uh oh!

Releases 7

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

Licenses found

gesistsa/adaR

Folders and files

Latest commit

History

Repository files navigation

adaR

Installation

Example

Public Suffix extraction

Acknowledgement

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages