Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pull): filter by scanning all rows #351

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

youen
Copy link
Collaborator

@youen youen commented Feb 19, 2025

This patch introduces a --scann flag that modifies the behavior of a pull. This is particularly useful when you need to extract a significant portion of your database. Instead of filtering data during the extraction process, this mode allows you to first pull all the data and then apply the filter as a post-processing step.

This method can be faster than querying the database for each individual row with filters applied, especially when dealing with large datasets. It minimizes the need for multiple database queries and speeds up the extraction process by retrieving all the data at once, then excluding unwanted rows afterward.

Changes Overview:

  1. CLI Update:

    • A new scann flag (--scann) is added to the pull command for filtering in memory.
    • The command is updated to handle both types of filtering: using filters from files and using the scann flag to filter in memory.
  2. Handler Changes:

    • The handler for pulling data is updated to pass the scann option, which influences how the filters are applied.
  3. Driver Interface Update:

    • The Pull method's signature is updated to accept an additional parameter included KeyStore to support filtering in memory.
  4. Test Additions:

    • Several tests are added for the --scann functionality, including tests for filtering with files, applying filters, handling no matches, and ensuring order consistency.

Summary of Key Modifications:

  • CLI: The command now checks for the --scann flag and uses it to load data into memory for filtering instead of using a database filter.
  • Puller Logic: Both puller and pullerParallel now handle the new included KeyStore to filter data in memory when scann is enabled.
  • Tests: New tests are introduced to ensure the functionality of filtering with the --scann flag, including cases where there are no matches or multiple rows with specific filters.

Example of the --scann Flag Use:

lino pull source --scann --filter-from-file customer_filter.jsonl

The flag ensures that the entire dataset is pulled and then filtered in memory based on the provided customer_filter.jsonl file.

@youen youen requested a review from adrienaury February 19, 2025 22:08
@adrienaury
Copy link
Member

Thank you @youen :)

There is failing tests :

pull-one-value-scann-mode FAIL
pull-greater-than-filter-with-json-logs FAIL

Why --scann and not --scan for the flag name ?

Copy link
Member

@adrienaury adrienaury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still failing tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants