Skip to content
This repository was archived by the owner on Feb 11, 2024. It is now read-only.

Conversation

@chainsawriot
Copy link
Contributor

No description provided.

@chainsawriot
Copy link
Contributor Author

chainsawriot commented Nov 21, 2023

20x slower than #26 recorded in #20 by @schochastics

Several possibilities

  • Do we need to make index for every document?
  • Where are the bottlenecks?
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"   398ms  428ms      2.34     167MB     18.7

Created on 2023-11-21 with reprex v2.0.2

@chainsawriot
Copy link
Contributor Author

verdammt

poss <- seq_along(as.character(tokenized_text))

@chainsawriot
Copy link
Contributor Author

chainsawriot commented Nov 21, 2023

051869f is 3x slower

require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  94.7ms  107ms      7.65     159MB     44.0

Created on 2023-11-21 with reprex v2.0.2

@chainsawriot
Copy link
Contributor Author

chainsawriot commented Nov 21, 2023

789d1fb 2x

Given this introduces more functionalities (phrase etc), I think it should be enough (although further optz is certainly possible).

require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  53.1ms 63.2ms      13.1    98.9MB     56.0
bench::mark(quanteda::index(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "quanteda::index(toks, c(\"a\"))"   5.22ms 5.59ms      175.    2.22MB     13.3

Created on 2023-11-21 with reprex v2.0.2

@chainsawriot chainsawriot marked this pull request as ready for review November 21, 2023 20:58
@chainsawriot chainsawriot merged commit 6272667 into v0.0 Nov 21, 2023
@chainsawriot chainsawriot deleted the pattern branch November 21, 2023 20:59
@chainsawriot chainsawriot mentioned this pull request Nov 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants