Consider using quaneda::index() #38

koheiw · 2023-11-18T00:59:38Z

I suggest you to use index() could be used to find positions of keywords including phrases.

library(quanteda.proximity)
library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.

txt <-
  c("Turkish President Tayyip Erdogan, in his strongest comments yet on the Gaza conflict, said on Wednesday the Palestinian militant group Hamas was not a terrorist organisation but a liberation group fighting to protect Palestinian lands.",
    "EU policymakers proposed the new agency in 2021 to stop financial firms from aiding criminals and terrorists. Brussels has so far relied on national regulators with no EU authority to stop money laundering and terrorist financing running into billions of euros.")

toks <- tokens(txt) 
len <- ntoken(toks)
idx <- index(toks, pattern = phrase("Tayyip Erdogan"))
pmin(abs(seq_len(len[idx$docname]) - idx$from), abs(seq_len(len[idx$docname]) - idx$to))
#>  [1]  2  1  0  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
#> [26] 22 23 24 25 26 27 28 29 30 31 32 33 34

More generally, patters2fixed() can be used to parse patters in the same way as in quanteda.

quanteda.proximity/R/get_dist.R

Lines 28 to 39 in dbd414c

    
           resolve_keywords <- function(keywords, features, valuetype) { 
        
               if (valuetype == "fixed") { 
        
                   return(keywords) 
        
               } 
        
               if (valuetype == "glob") { 
        
                   regex <- paste(utils::glob2rx(keywords), collapse = "|") 
        
               } 
        
               if (valuetype == "regex") { 
        
                   regex <- paste(keywords, collapse = "|") 
        
               } 
        
               return(grep(regex, features, value = TRUE)) 
        
           }

The text was updated successfully, but these errors were encountered:

chainsawriot · 2023-11-20T10:58:44Z

Thank you very much for the suggestions @koheiw

investigate quanteda::index()
use quanteda::pattern2*()

* Use quanteda::index() ref #38 * Indexing only once * More optz * Code clean up [no ci] * Add `case_insensitive` and make `phrase` work * Add tests; also add a custom field That custom field might be useful for `dfm()` * Explicit return value style guide...

chainsawriot added a commit that referenced this issue Nov 21, 2023

Use quanteda::index() ref #38

bf530b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using quaneda::index() #38

Consider using quaneda::index() #38

koheiw commented Nov 18, 2023

chainsawriot commented Nov 20, 2023 •

edited

Loading

Consider using quaneda::index() #38

Consider using quaneda::index() #38

Comments

koheiw commented Nov 18, 2023

chainsawriot commented Nov 20, 2023 • edited Loading

chainsawriot commented Nov 20, 2023 •

edited

Loading