Skip to content
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.

Consider avoiding new tokens child class #35

Open
kbenoit opened this issue Nov 17, 2023 · 6 comments
Open

Consider avoiding new tokens child class #35

kbenoit opened this issue Nov 17, 2023 · 6 comments

Comments

@kbenoit
Copy link

kbenoit commented Nov 17, 2023

Right now, there is a single pattern for each proximity computation, stored as the docvar proximity that is a list of distances. This has some drawbacks.

  1. It allows a single proximity distance for each document, with a hardwired pattern - even when that pattern is not fixed and therefore could match multiple elements. This would be similar to implementing kwic() in this way, rather than generating a new object with a different structure.

  2. Having proximities as list elements for a tokens type means these can become wrong if the tokens are modified in any way, through tokens_wordstem/tolower/remove() etc. (A solution like kwic() to generate a new object would avoid this.)

@chainsawriot
Copy link
Contributor

chainsawriot commented Nov 17, 2023

@kbenoit Thank you for the suggestion. This is also the reason for #33. I think your suggestion for making tokens_with_proximity a class of its own (not a sister of tokens) is really good.

  • Make tokens_with_proximity a class of its own

@koheiw
Copy link

koheiw commented Nov 18, 2023

I like the way you saved additional information in a list column, but it is true that changes in the token object breaks it. It worth considering token-level meta field official, but you should make it impossible to apply tokens_select/remove for now by giving a class like tokens_proximity or something.

@chainsawriot
Copy link
Contributor

chainsawriot commented Nov 20, 2023

  • Make sure the unique tokens_with_proximity do not work with tokens_select() and friends
  • Provide tokens.tokens_with_proximity() to convert an object back to tokens for further manipulation

@chainsawriot
Copy link
Contributor

chainsawriot commented Nov 20, 2023

  • Add a tolower option to tokens_proximity and record it in metadata (default to TRUE?). So that when processing the hardcoded dfm(tolower), we don't need to tolower again. ref Make it 100% compatible with quanteda #27

@chainsawriot
Copy link
Contributor

chainsawriot commented Nov 20, 2023

  • docvars.tokens_with_proximity methods
  • meta.tokens_with_proximity methods

@chainsawriot
Copy link
Contributor

chainsawriot commented Nov 20, 2023

  • tokens_proximity() works with tokens_with_proxmity object

chainsawriot added a commit that referenced this issue Nov 20, 2023
* Make the class unique and add several methods ref #35

* Make tokens_proxitmity() still work for changing keywords [no ci]

* Update Doc [no ci]
chainsawriot added a commit that referenced this issue Nov 20, 2023
chainsawriot added a commit that referenced this issue Nov 20, 2023
* Make `tolower` default for tokens_proximity()

* Update README
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants