Skip to content

Commit

Permalink
Add MH things
Browse files Browse the repository at this point in the history
  • Loading branch information
chainsawriot committed Jun 19, 2024
1 parent 3e62518 commit 21c34d8
Show file tree
Hide file tree
Showing 10 changed files with 465 additions and 0 deletions.
8 changes: 8 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,11 @@
^pkgdown$
^cran-comments\.md$
^CRAN-SUBMISSION$
^CITATION\.cff$
^install\.R$
^postBuild$
^apt\.txt$
^runtime\.txt$
^_quarto\.yml$
^\.quarto$
^methodshub
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@
README.html
inst/doc
docs

/.quarto/
155 changes: 155 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# --------------------------------------------
# CITATION file created with {cffr} R package
# See also: https://docs.ropensci.org/cffr/
# --------------------------------------------

cff-version: 1.2.0
message: 'To cite package "adaR" in publications use:'
type: software
license: MIT
title: 'adaR: A Fast ''WHATWG'' Compliant URL Parser'
version: 0.3.2
abstract: A wrapper for 'ada-url', a 'WHATWG' compliant and fast URL parser written
in modern 'C++'. Also contains auxiliary functions such as a public suffix extractor.
authors:
- family-names: Schoch
given-names: David
email: [email protected]
orcid: https://orcid.org/0000-0003-2952-4812
- family-names: Chan
given-names: Chung-hong
email: [email protected]
orcid: https://orcid.org/0000-0002-6232-7530
repository: https://CRAN.R-project.org/package=adaR
repository-code: https://github.com/gesistsa/adaR
url: https://gesistsa.github.io/adaR/
contact:
- family-names: Schoch
given-names: David
email: [email protected]
orcid: https://orcid.org/0000-0003-2952-4812
keywords:
- r
- rstats
- rstats-package
- url-parser
references:
- type: software
title: Rcpp
abstract: 'Rcpp: Seamless R and C++ Integration'
notes: LinkingTo
url: https://www.rcpp.org
repository: https://CRAN.R-project.org/package=Rcpp
authors:
- family-names: Eddelbuettel
given-names: Dirk
- family-names: Francois
given-names: Romain
- family-names: Allaire
given-names: JJ
- family-names: Ushey
given-names: Kevin
- family-names: Kou
given-names: Qiang
- family-names: Russell
given-names: Nathan
- family-names: Ucar
given-names: Inaki
- family-names: Bates
given-names: Douglas
- family-names: Chambers
given-names: John
year: '2024'
- type: software
title: triebeard
abstract: 'triebeard: ''Radix'' Trees in ''Rcpp'''
notes: Imports
url: https://github.com/Ironholds/triebeard/
repository: https://CRAN.R-project.org/package=triebeard
authors:
- family-names: Keyes
given-names: Os
- family-names: Schmidt
given-names: Drew
- family-names: Takano
given-names: Yuuki
year: '2024'
- type: software
title: knitr
abstract: 'knitr: A General-Purpose Package for Dynamic Report Generation in R'
notes: Suggests
url: https://yihui.org/knitr/
repository: https://CRAN.R-project.org/package=knitr
authors:
- family-names: Xie
given-names: Yihui
email: [email protected]
orcid: https://orcid.org/0000-0003-0645-5666
year: '2024'
- type: software
title: rmarkdown
abstract: 'rmarkdown: Dynamic Documents for R'
notes: Suggests
url: https://pkgs.rstudio.com/rmarkdown/
repository: https://CRAN.R-project.org/package=rmarkdown
authors:
- family-names: Allaire
given-names: JJ
email: [email protected]
- family-names: Xie
given-names: Yihui
email: [email protected]
orcid: https://orcid.org/0000-0003-0645-5666
- family-names: Dervieux
given-names: Christophe
email: [email protected]
orcid: https://orcid.org/0000-0003-4474-2498
- family-names: McPherson
given-names: Jonathan
email: [email protected]
- family-names: Luraschi
given-names: Javier
- family-names: Ushey
given-names: Kevin
email: [email protected]
- family-names: Atkins
given-names: Aron
email: [email protected]
- family-names: Wickham
given-names: Hadley
email: [email protected]
- family-names: Cheng
given-names: Joe
email: [email protected]
- family-names: Chang
given-names: Winston
email: [email protected]
- family-names: Iannone
given-names: Richard
email: [email protected]
orcid: https://orcid.org/0000-0003-3925-190X
year: '2024'
- type: software
title: testthat
abstract: 'testthat: Unit Testing for R'
notes: Suggests
url: https://testthat.r-lib.org
repository: https://CRAN.R-project.org/package=testthat
authors:
- family-names: Wickham
given-names: Hadley
email: [email protected]
year: '2024'
version: '>= 3.0.0'
- type: software
title: 'R: A Language and Environment for Statistical Computing'
notes: Depends
url: https://www.R-project.org/
authors:
- name: R Core Team
institution:
name: R Foundation for Statistical Computing
address: Vienna, Austria
year: '2024'
version: '>= 4.2'

5 changes: 5 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
project:
title: adaR
type: default
render:
- methodshub.qmd
1 change: 1 addition & 0 deletions apt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
zip
1 change: 1 addition & 0 deletions install.R
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
install.packages("adaR")
126 changes: 126 additions & 0 deletions methodshub.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# adaR - A Fast ‘WHATWG’ Compliant URL Parser


## Description

<!-- - Provide a brief and clear description of the method, its purpose, and what it aims to achieve. Add a link to a related paper from social science domain and show how your method can be applied to solve that research question. -->

A wrapper for ‘ada-url’, a ‘WHATWG’ compliant and fast URL parser
written in modern ‘C++’. Also contains auxiliary functions such as a
public suffix extractor.

## Keywords

<!-- EDITME -->

- URL Parsing
- Webtracking Data
- Webscraping

## Science Usecase(s)

<!-- - Include usecases from social sciences that would make this method applicable in a certain scenario. -->
<!-- The use cases or research questions mentioned should arise from the latest social science literature cited in the description. -->

URL parsing is an important process in the analysis of webtracking data,
e.g. [GESIS Web
Tracking](https://www.gesis.org/en/services/planning-studies-and-collecting-data/tools-for-the-collection-of-digital-behavioral-data/gesis-web-tracking).
Although not using this package, the technique has been used in various
social science publications, e.g. [de León et
al. (2023)](https://doi.org/10.5117/CCR2023.2.4.DELE).

The package was used in various webscraping projects for communication
research, e.g. [paperboy](https://github.com/JBGruber/paperboy).

## Repository structure

This repository follows [the standard structure of an R
package](https://cran.r-project.org/doc/FAQ/R-exts.html#Package-structure).

## Environment Setup

With R installed:

``` r
install.packages("adaR")
```

<!-- ## Hardware Requirements (Optional) -->
<!-- - The hardware requirements may be needed in specific cases when a method is known to require more memory/compute power. -->
<!-- - The method need to be executed on a specific architecture (GPUs, Hadoop cluster etc.) -->

## Input Data

<!-- - The input data has to be a Digital Behavioral Data (DBD) Dataset -->
<!-- - You can provide link to a public DBD dataset. GESIS DBD datasets (https://www.gesis.org/en/institute/digital-behavioral-data) -->

The input data has to be a vector of URLs.

## Sample Input and Output Data

<!-- - Show how the input data looks like through few sample instances -->
<!-- - Providing a sample output on the sample input to help cross check -->

The input data looks like this:

``` r
urls <- c("https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1")

urls
```

[1] "https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1"

The output data is a data frame of parsed URLs.

## How to Use

<!-- - Providing HowTos on the method for different types of usages -->
<!-- - Describe how the method should be used, including installation, configuration, and any specific instructions for users. -->

Please refer to the [“Introduction to
adaR”](https://gesistsa.github.io/adaR/articles/adaR.html) for a
comprehensive introduction of the package.

The main function of this package is `ada_url_parse()` and it decomposes
a url into its components.

``` r
library(adaR)

urls <- c("https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1",
"https://www.nytimes.com/2024/06/19/world/africa/sudan-darfur-takeaways.html",
"https://www.sueddeutsche.de/thema/Fu%C3%9Fball-EM")

ada_url_parse(urls)
```

href
1 https://www.google.de/search?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1
2 https://www.nytimes.com/2024/06/19/world/africa/sudan-darfur-takeaways.html
3 https://www.sueddeutsche.de/thema/Fußball-EM
protocol username password host hostname port
1 https: www.google.de www.google.de
2 https: www.nytimes.com www.nytimes.com
3 https: www.sueddeutsche.de www.sueddeutsche.de
pathname
1 /search
2 /2024/06/19/world/africa/sudan-darfur-takeaways.html
3 /thema/Fußball-EM
search hash
1 ?q=GESIS&client=ubuntu&hs=ixb&sca_esv=dccc38f8e2930152&sca_upv=1
2
3

## Contact Details

Maintainer: David Schoch <[email protected]>

Issue Tracker: <https://github.com/gesistsa/adaR/issues>

<!-- ## Publication -->
<!-- - Include information on publications or articles related to the method, if applicable. -->
<!-- ## Acknowledgements -->
<!-- - Acknowledgements if any -->
<!-- ## Disclaimer -->
<!-- - Add any disclaimers, legal notices, or usage restrictions for the method, if necessary. -->
Loading

0 comments on commit 21c34d8

Please sign in to comment.