Skip to content
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.

tokenvars(x, "proximity") #53

Open
chainsawriot opened this issue Nov 24, 2023 · 0 comments
Open

tokenvars(x, "proximity") #53

chainsawriot opened this issue Nov 24, 2023 · 0 comments

Comments

@chainsawriot
Copy link
Contributor

chainsawriot commented Nov 24, 2023

The reason for the implementation, i.e. putting a list-column in the docvars data frame, being hacky is because the list-column is actually storing token-level data.

quanteda/spacyr actually has the same issue quanteda/spacyr#77. as.tokens.spacyr_parsed(x, include_pos = TRUE) generating something like "great/ADJ" as a token is IMO also hacky.

ropensci/tif states (emphasis added):

tokens (data frame) - A valid data frame tokens object is a data frame with at least two columns. There must be a column called doc_id that is a character vector with UTF-8 encoding. Document ids must be unique. There must also be a column called token that must also be a character vector in UTF-8 encoding. Each individual token is represented by a single row in the data frame. Addition token-level metadata columns are allowed but not required.

tokens (list) - A valid corpus tokens object is (possibly named) list of character vectors. The character vectors, as well as names, should be in UTF-8 encoding. No other attributes should be present in either the list or any of its elements.

quanteda's tokens object is taking the list approach; and thus no token-level metadata. Is there a better way to store the token-level metadata in the current tokens object?

chainsawriot added a commit to gesistsa/tokenvars that referenced this issue Nov 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant