A wrapper for ‘ada-url’, a ‘WHATWG’ compliant and fast URL parser written in modern ‘C++’. Also contains auxiliary functions such as a public suffix extractor.
- URL Parsing
- Webtracking Data
- Webscraping
URL parsing is an important process in the analysis of webtracking data, e.g. GESIS Web Tracking. Although not using this package, the technique has been used in various social science publications, e.g. de León et al. (2023).
The package was used in various webscraping projects for communication research, e.g. paperboy.
This repository follows the standard structure of an R package.
With R installed:
install.packages("adaR")
The input data has to be a vector of URLs.
The input data looks like this:
urls <- c("https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1")
urls
[1] "https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1"
The output data is a data frame of parsed URLs.
Please refer to the “Introduction to adaR” for a comprehensive introduction of the package.
The main function of this package is ada_url_parse()
and it decomposes
a url into its components.
library(adaR)
urls <- c("https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1",
"https://www.nytimes.com/2024/06/19/world/africa/sudan-darfur-takeaways.html",
"https://www.sueddeutsche.de/thema/Fu%C3%9Fball-EM")
ada_url_parse(urls)
href
1 https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1
2 https://www.nytimes.com/2024/06/19/world/africa/sudan-darfur-takeaways.html
3 https://www.sueddeutsche.de/thema/Fußball-EM
protocol username password host hostname port
1 https: www.google.de www.google.de
2 https: www.nytimes.com www.nytimes.com
3 https: www.sueddeutsche.de www.sueddeutsche.de
pathname
1 /search
2 /2024/06/19/world/africa/sudan-darfur-takeaways.html
3 /thema/Fußball-EM
search hash
1 ?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1
2
3
Maintainer: David Schoch [email protected]
Issue Tracker: https://github.com/gesistsa/adaR/issues