-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow buffered input streams #23
Labels
acceptance: go ahead
Reviewed, implementation can start
area: performance
Performance improvements
help wanted
External contributions welcome
type: feature
New feature or request
Milestone
Comments
V0ldek
added
type: feature
New feature or request
help wanted
External contributions welcome
acceptance: go ahead
Reviewed, implementation can start
labels
Sep 20, 2022
V0ldek
added a commit
that referenced
this issue
May 12, 2023
A more abstract API to access the underlying byte stream replacing the reliance of the engines on a direct `&[u8]` slice access, to allow adding buffered input streams (#23) in the future. Two types were added, `OwnedBytes` and `BorrowedBytes`, to support the current easy scenario of having the bytes already in memory. Ref: #23
V0ldek
added a commit
that referenced
this issue
Jun 14, 2023
- Added `MmapInput` which maps a file into memory on unix and windows. - The CLI app now automatically decides which input to use, favoring mmap in most cases. This can be overriden with `--force-input`. Ref: #23
github-actions
bot
removed
the
acceptance: go ahead
Reviewed, implementation can start
label
Jun 14, 2023
This is not closed yet. We need a smarter buffered input for when Mmap is unavailable, one that does not hold the entire input in memory. |
Tagging @V0ldek for notifications |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
acceptance: go ahead
Reviewed, implementation can start
area: performance
Performance improvements
help wanted
External contributions welcome
type: feature
New feature or request
Is your feature request related to a problem? Please describe.
Current implementation reads the entire input to a string. This is not production-viable – very large files that we are targeting with all the performance improvements might not fit in memory. A first step would be to enable buffered reading – load a single page worth of input at a time. There are challenges here – it is possible for a single logical query step to span arbitrarily many blocks, e.g. JSON labels can be arbitrarily long.
Describe the solution you'd like
First of all, current implementations heavily rely on raw
AlignedSlice
data. This should be abstracted behind a buffered input that can yield slices on-demand.Two, the query engines need to be made aware of this. They currently rely on having all the data available to index into the slice and compare labels. The engines also need to communicate to the classifiers at which point it is safe to stop keeping old input blocks in memory – we always need the entire label before the currently looked-at colon to be buffered, but after we examine it, it can be discarded.
The text was updated successfully, but these errors were encountered: