Skip to content
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.

Consider using quaneda::index() #38

Open
koheiw opened this issue Nov 18, 2023 · 1 comment
Open

Consider using quaneda::index() #38

koheiw opened this issue Nov 18, 2023 · 1 comment

Comments

@koheiw
Copy link

koheiw commented Nov 18, 2023

I suggest you to use index() could be used to find positions of keywords including phrases.

library(quanteda.proximity)
library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.

txt <-
  c("Turkish President Tayyip Erdogan, in his strongest comments yet on the Gaza conflict, said on Wednesday the Palestinian militant group Hamas was not a terrorist organisation but a liberation group fighting to protect Palestinian lands.",
    "EU policymakers proposed the new agency in 2021 to stop financial firms from aiding criminals and terrorists. Brussels has so far relied on national regulators with no EU authority to stop money laundering and terrorist financing running into billions of euros.")

toks <- tokens(txt) 
len <- ntoken(toks)
idx <- index(toks, pattern = phrase("Tayyip Erdogan"))
pmin(abs(seq_len(len[idx$docname]) - idx$from), abs(seq_len(len[idx$docname]) - idx$to))
#>  [1]  2  1  0  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
#> [26] 22 23 24 25 26 27 28 29 30 31 32 33 34

More generally, patters2fixed() can be used to parse patters in the same way as in quanteda.

resolve_keywords <- function(keywords, features, valuetype) {
if (valuetype == "fixed") {
return(keywords)
}
if (valuetype == "glob") {
regex <- paste(utils::glob2rx(keywords), collapse = "|")
}
if (valuetype == "regex") {
regex <- paste(keywords, collapse = "|")
}
return(grep(regex, features, value = TRUE))
}

@chainsawriot
Copy link
Contributor

chainsawriot commented Nov 20, 2023

Thank you very much for the suggestions @koheiw

  • investigate quanteda::index()
  • use quanteda::pattern2*()

chainsawriot added a commit that referenced this issue Nov 21, 2023
chainsawriot added a commit that referenced this issue Nov 21, 2023
* Use quanteda::index() ref #38

* Indexing only once

* More optz

* Code clean up [no ci]

* Add `case_insensitive` and make `phrase` work

* Add tests; also add a custom field

That custom field might be useful for `dfm()`

* Explicit return value

style guide...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants