`tokenvars(x, "proximity")` #53

chainsawriot · 2023-11-24T17:34:39Z

The reason for the implementation, i.e. putting a list-column in the docvars data frame, being hacky is because the list-column is actually storing token-level data.

quanteda/spacyr actually has the same issue quanteda/spacyr#77. as.tokens.spacyr_parsed(x, include_pos = TRUE) generating something like "great/ADJ" as a token is IMO also hacky.

ropensci/tif states (emphasis added):

tokens (data frame) - A valid data frame tokens object is a data frame with at least two columns. There must be a column called doc_id that is a character vector with UTF-8 encoding. Document ids must be unique. There must also be a column called token that must also be a character vector in UTF-8 encoding. Each individual token is represented by a single row in the data frame. Addition token-level metadata columns are allowed but not required.

tokens (list) - A valid corpus tokens object is (possibly named) list of character vectors. The character vectors, as well as names, should be in UTF-8 encoding. No other attributes should be present in either the list or any of its elements.

quanteda's tokens object is taking the list approach; and thus no token-level metadata. Is there a better way to store the token-level metadata in the current tokens object?

The text was updated successfully, but these errors were encountered:

chainsawriot added a commit to gesistsa/tokenvars that referenced this issue Nov 24, 2023

gesistsa/quanteda.proximity#53

28c0c45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tokenvars(x, "proximity")` #53

`tokenvars(x, "proximity")` #53

chainsawriot commented Nov 24, 2023 •

edited

Loading

tokenvars(x, "proximity") #53

tokenvars(x, "proximity") #53

Comments

chainsawriot commented Nov 24, 2023 • edited Loading

`tokenvars(x, "proximity")` #53

`tokenvars(x, "proximity")` #53

chainsawriot commented Nov 24, 2023 •

edited

Loading