Skip to content

Latest commit

 

History

History
126 lines (87 loc) · 5.01 KB

methodshub.md

File metadata and controls

126 lines (87 loc) · 5.01 KB

adaR - A Fast ‘WHATWG’ Compliant URL Parser

Description

A wrapper for ‘ada-url’, a ‘WHATWG’ compliant and fast URL parser written in modern ‘C++’. Also contains auxiliary functions such as a public suffix extractor.

Keywords

  • URL Parsing
  • Webtracking Data
  • Webscraping

Science Usecase(s)

URL parsing is an important process in the analysis of webtracking data, e.g. GESIS Web Tracking. Although not using this package, the technique has been used in various social science publications, e.g. de León et al. (2023).

The package was used in various webscraping projects for communication research, e.g. paperboy.

Repository structure

This repository follows the standard structure of an R package.

Environment Setup

With R installed:

install.packages("adaR")

Input Data

The input data has to be a vector of URLs.

Sample Input and Output Data

The input data looks like this:

urls <- c("https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1")

urls
[1] "https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1"

The output data is a data frame of parsed URLs.

How to Use

Please refer to the “Introduction to adaR” for a comprehensive introduction of the package.

The main function of this package is ada_url_parse() and it decomposes a url into its components.

library(adaR)

urls <- c("https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1",
          "https://www.nytimes.com/2024/06/19/world/africa/sudan-darfur-takeaways.html",
          "https://www.sueddeutsche.de/thema/Fu%C3%9Fball-EM")

ada_url_parse(urls)
                                                                                          href
1 https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1
2                  https://www.nytimes.com/2024/06/19/world/africa/sudan-darfur-takeaways.html
3                                                 https://www.sueddeutsche.de/thema/Fußball-EM
  protocol username password                host            hostname port
1   https:                         www.google.de       www.google.de     
2   https:                       www.nytimes.com     www.nytimes.com     
3   https:                   www.sueddeutsche.de www.sueddeutsche.de     
                                              pathname
1                                              /search
2 /2024/06/19/world/africa/sudan-darfur-takeaways.html
3                                    /thema/Fußball-EM
                                                            search hash
1 ?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1     
2                                                                      
3                                                                      

Contact Details

Maintainer: David Schoch [email protected]

Issue Tracker: https://github.com/gesistsa/adaR/issues