feat(pull): filter by scanning all rows #351

youen · 2025-02-19T22:08:05Z

This patch introduces a --scann flag that modifies the behavior of a pull. This is particularly useful when you need to extract a significant portion of your database. Instead of filtering data during the extraction process, this mode allows you to first pull all the data and then apply the filter as a post-processing step.

This method can be faster than querying the database for each individual row with filters applied, especially when dealing with large datasets. It minimizes the need for multiple database queries and speeds up the extraction process by retrieving all the data at once, then excluding unwanted rows afterward.

Changes Overview:

CLI Update:
- A new scann flag (--scann) is added to the pull command for filtering in memory.
- The command is updated to handle both types of filtering: using filters from files and using the scann flag to filter in memory.
Handler Changes:
- The handler for pulling data is updated to pass the scann option, which influences how the filters are applied.
Driver Interface Update:
- The Pull method's signature is updated to accept an additional parameter included KeyStore to support filtering in memory.
Test Additions:
- Several tests are added for the --scann functionality, including tests for filtering with files, applying filters, handling no matches, and ensuring order consistency.

Summary of Key Modifications:

CLI: The command now checks for the --scann flag and uses it to load data into memory for filtering instead of using a database filter.
Puller Logic: Both puller and pullerParallel now handle the new included KeyStore to filter data in memory when scann is enabled.
Tests: New tests are introduced to ensure the functionality of filtering with the --scann flag, including cases where there are no matches or multiple rows with specific filters.

Example of the `--scann` Flag Use:

lino pull source --scann --filter-from-file customer_filter.jsonl

The flag ensures that the entire dataset is pulled and then filtered in memory based on the provided customer_filter.jsonl file.

adrienaury · 2025-02-21T09:10:41Z

Thank you @youen :)

There is failing tests :

pull-one-value-scann-mode FAIL
pull-greater-than-filter-with-json-logs FAIL

Why --scann and not --scan for the flag name ?

adrienaury

Still failing tests

feat(pull): filter by scanning all rows

448a2fe

youen requested a review from adrienaury February 19, 2025 22:08

fix(pull): bad error test

f04cda2

fix(pull): rename scann to scan

382ea7c

adrienaury requested changes Mar 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pull): filter by scanning all rows #351

feat(pull): filter by scanning all rows #351

youen commented Feb 19, 2025

adrienaury commented Feb 21, 2025

adrienaury left a comment

feat(pull): filter by scanning all rows #351

Are you sure you want to change the base?

feat(pull): filter by scanning all rows #351

Conversation

youen commented Feb 19, 2025

Changes Overview:

Summary of Key Modifications:

Example of the --scann Flag Use:

adrienaury commented Feb 21, 2025

adrienaury left a comment

Choose a reason for hiding this comment

Example of the `--scann` Flag Use: