Use quanteda::index() ref #38 #44

chainsawriot · 2023-11-21T17:18:46Z

No description provided.

chainsawriot · 2023-11-21T17:32:46Z

20x slower than #26 recorded in #20 by @schochastics

Several possibilities

Do we need to make index for every document?
Where are the bottlenecks?

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"   398ms  428ms      2.34     167MB     18.7

^{Created on 2023-11-21 with reprex v2.0.2}

chainsawriot · 2023-11-21T17:39:29Z

verdammt

quanteda.proximity/R/get_dist.R

Line 12 in bf530b4

poss <- seq_along(as.character(tokenized_text))

chainsawriot · 2023-11-21T18:23:32Z

051869f is 3x slower

require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  94.7ms  107ms      7.65     159MB     44.0

^{Created on 2023-11-21 with reprex v2.0.2}

chainsawriot · 2023-11-21T18:41:12Z

789d1fb 2x

Given this introduces more functionalities (phrase etc), I think it should be enough (although further optz is certainly possible).

require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  53.1ms 63.2ms      13.1    98.9MB     56.0
bench::mark(quanteda::index(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "quanteda::index(toks, c(\"a\"))"   5.22ms 5.59ms      175.    2.22MB     13.3

^{Created on 2023-11-21 with reprex v2.0.2}

That custom field might be useful for `dfm()`

style guide...

Use quanteda::index() ref #38

bf530b4

Indexing only once

051869f

More optz

789d1fb

chainsawriot added 4 commits November 21, 2023 19:58

Code clean up [no ci]

7c5b9c6

Add case_insensitive and make phrase work

df6d67d

Add tests; also add a custom field

cee8d66

That custom field might be useful for `dfm()`

Explicit return value

6521689

style guide...

chainsawriot marked this pull request as ready for review November 21, 2023 20:58

chainsawriot merged commit 6272667 into v0.0 Nov 21, 2023

chainsawriot deleted the pattern branch November 21, 2023 20:59

chainsawriot mentioned this pull request Nov 21, 2023

Optimization #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use quanteda::index() ref #38 #44

Use quanteda::index() ref #38 #44

chainsawriot commented Nov 21, 2023

chainsawriot commented Nov 21, 2023 •

edited

Loading

chainsawriot commented Nov 21, 2023

chainsawriot commented Nov 21, 2023 •

edited

Loading

chainsawriot commented Nov 21, 2023 •

edited

Loading

Use quanteda::index() ref #38 #44

Use quanteda::index() ref #38 #44

Conversation

chainsawriot commented Nov 21, 2023

chainsawriot commented Nov 21, 2023 • edited Loading

chainsawriot commented Nov 21, 2023

chainsawriot commented Nov 21, 2023 • edited Loading

chainsawriot commented Nov 21, 2023 • edited Loading

chainsawriot commented Nov 21, 2023 •

edited

Loading

chainsawriot commented Nov 21, 2023 •

edited

Loading

chainsawriot commented Nov 21, 2023 •

edited

Loading