You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.
The reason for the implementation, i.e. putting a list-column in the docvars data frame, being hacky is because the list-column is actually storing token-level data.
quanteda/spacyr actually has the same issue quanteda/spacyr#77. as.tokens.spacyr_parsed(x, include_pos = TRUE) generating something like "great/ADJ" as a token is IMO also hacky.
tokens (data frame) - A valid data frame tokens object is a data frame with at least two columns. There must be a column called doc_id that is a character vector with UTF-8 encoding. Document ids must be unique. There must also be a column called token that must also be a character vector in UTF-8 encoding. Each individual token is represented by a single row in the data frame. Addition token-level metadata columns are allowed but not required.
tokens (list) - A valid corpus tokens object is (possibly named) list of character vectors. The character vectors, as well as names, should be in UTF-8 encoding. No other attributes should be present in either the list or any of its elements.
quanteda's tokens object is taking the list approach; and thus no token-level metadata. Is there a better way to store the token-level metadata in the current tokens object?
The text was updated successfully, but these errors were encountered:
The reason for the implementation, i.e. putting a list-column in the
docvars
data frame, being hacky is because the list-column is actually storing token-level data.quanteda/spacyr actually has the same issue quanteda/spacyr#77.
as.tokens.spacyr_parsed(x, include_pos = TRUE)
generating something like "great/ADJ" as a token is IMO also hacky.ropensci/tif states (emphasis added):
quanteda's
tokens
object is taking the list approach; and thus no token-level metadata. Is there a better way to store the token-level metadata in the currenttokens
object?The text was updated successfully, but these errors were encountered: