Download random sample of documents #1435
lukavdplas
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Users often don't have the rights to download an entire corpus. Also, for a preliminary analysis or educational purposes, working with the whole data may not be desired.
However, the default sorting of corpora is not random - I think this is based on indexing order. In some cases, this may be considered random in that the factors that determine indexing order have no relation with anything you would be analysing, but it usually introduces a bias of some kind.
I think it would be nice if users could download a random sample of documents.
This requires some randomisation, which can be achieved in one of the following ways:
In addition, you could add the option to limit the download size.
Beta Was this translation helpful? Give feedback.
All reactions