Allow buffered input streams #23

V0ldek · 2022-09-20T10:32:29Z

Is your feature request related to a problem? Please describe.
Current implementation reads the entire input to a string. This is not production-viable – very large files that we are targeting with all the performance improvements might not fit in memory. A first step would be to enable buffered reading – load a single page worth of input at a time. There are challenges here – it is possible for a single logical query step to span arbitrarily many blocks, e.g. JSON labels can be arbitrarily long.

Describe the solution you'd like
First of all, current implementations heavily rely on raw AlignedSlice data. This should be abstracted behind a buffered input that can yield slices on-demand.

Two, the query engines need to be made aware of this. They currently rely on having all the data available to index into the slice and compare labels. The engines also need to communicate to the classifiers at which point it is safe to stop keeping old input blocks in memory – we always need the entire label before the currently looked-at colon to be buffered, but after we examine it, it can be discarded.

The text was updated successfully, but these errors were encountered:

A more abstract API to access the underlying byte stream replacing the reliance of the engines on a direct `&[u8]` slice access, to allow adding buffered input streams (#23) in the future. Two types were added, `OwnedBytes` and `BorrowedBytes`, to support the current easy scenario of having the bytes already in memory. Ref: #23

- Added `MmapInput` which maps a file into memory on unix and windows. - The CLI app now automatically decides which input to use, favoring mmap in most cases. This can be overriden with `--force-input`. Ref: #23

V0ldek · 2023-06-14T10:30:36Z

This is not closed yet. We need a smarter buffered input for when Mmap is unavailable, one that does not hold the entire input in memory.

github-actions · 2023-06-14T10:30:50Z

Tagging @V0ldek for notifications

V0ldek added type: feature New feature or request help wanted External contributions welcome acceptance: go ahead Reviewed, implementation can start labels Sep 20, 2022

V0ldek self-assigned this Sep 20, 2022

V0ldek removed their assignment Nov 13, 2022

V0ldek added this to the v1.0.0 milestone Nov 16, 2022

V0ldek added area: engine area: performance Performance improvements labels Nov 23, 2022

V0ldek self-assigned this Apr 9, 2023

V0ldek mentioned this issue May 11, 2023

V0ldek/input streams #133

Merged

3 tasks

V0ldek mentioned this issue Jun 13, 2023

V0ldek/input streams #149

Merged

3 tasks

V0ldek closed this as completed in #149 Jun 14, 2023

github-actions bot removed the acceptance: go ahead Reviewed, implementation can start label Jun 14, 2023

V0ldek reopened this Jun 14, 2023

V0ldek added the acceptance: go ahead Reviewed, implementation can start label Jun 14, 2023

github-actions bot added the acceptance: triage Waiting for owner's input label Jun 14, 2023

V0ldek removed the acceptance: triage Waiting for owner's input label Jun 28, 2023

V0ldek removed mod: classifier labels Oct 4, 2023

V0ldek modified the milestones: v1.0.0, v1.1.0 Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow buffered input streams #23

Allow buffered input streams #23

V0ldek commented Sep 20, 2022

V0ldek commented Jun 14, 2023

github-actions bot commented Jun 14, 2023

Allow buffered input streams #23

Allow buffered input streams #23

Comments

V0ldek commented Sep 20, 2022

V0ldek commented Jun 14, 2023

github-actions bot commented Jun 14, 2023