Skip to content
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.

Use quanteda::index() ref #38 #44

Merged
merged 7 commits into from
Nov 21, 2023
Merged

Use quanteda::index() ref #38 #44

merged 7 commits into from
Nov 21, 2023

Conversation

chainsawriot
Copy link
Contributor

No description provided.

@chainsawriot
Copy link
Contributor Author

chainsawriot commented Nov 21, 2023

20x slower than #26 recorded in #20 by @schochastics

Several possibilities

  • Do we need to make index for every document?
  • Where are the bottlenecks?
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"   398ms  428ms      2.34     167MB     18.7

Created on 2023-11-21 with reprex v2.0.2

Sorry, something went wrong.

@chainsawriot
Copy link
Contributor Author

verdammt

poss <- seq_along(as.character(tokenized_text))

@chainsawriot
Copy link
Contributor Author

chainsawriot commented Nov 21, 2023

051869f is 3x slower

require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  94.7ms  107ms      7.65     159MB     44.0

Created on 2023-11-21 with reprex v2.0.2

@chainsawriot
Copy link
Contributor Author

chainsawriot commented Nov 21, 2023

789d1fb 2x

Given this introduces more functionalities (phrase etc), I think it should be enough (although further optz is certainly possible).

require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  53.1ms 63.2ms      13.1    98.9MB     56.0
bench::mark(quanteda::index(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "quanteda::index(toks, c(\"a\"))"   5.22ms 5.59ms      175.    2.22MB     13.3

Created on 2023-11-21 with reprex v2.0.2

@chainsawriot chainsawriot marked this pull request as ready for review November 21, 2023 20:58
@chainsawriot chainsawriot merged commit 6272667 into v0.0 Nov 21, 2023
@chainsawriot chainsawriot deleted the pattern branch November 21, 2023 20:59
@chainsawriot chainsawriot mentioned this pull request Nov 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant