Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Rewrite using SQLite3 #1242

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Jun 17, 2023

Description of proposed changes

A work in progress - see commit messages

Related issue(s)

Related to #866

Testing

  • Create a nextstrain/base image from this branch: nextstrain/docker-base@5d4fad9
  • Verify that --query still works for existing workflows.
  • Verify that --query-sqlite works with SQLite query syntax.
    • I'm going to run a ncov rebuild using this. Will post results in a thread.

Checklist

  • Delete the temporary DB file (implemented)
  • Add a --debug option to show database file size breakdown
  • Address disk space usage concerns (ref)
  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

@victorlin victorlin self-assigned this Jun 17, 2023
@victorlin victorlin mentioned this pull request Jun 17, 2023
1 task
@victorlin victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 799fe02 to 27afe62 Compare June 17, 2023 05:46
@victorlin victorlin marked this pull request as draft June 20, 2023 17:19
@victorlin victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 27afe62 to f9d5c56 Compare June 20, 2023 22:15
@tsibley tsibley self-requested a review June 21, 2023 22:37
@victorlin victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from f9d5c56 to 5a9e0cc Compare June 23, 2023 22:23
Base automatically changed from victorlin/filter-rewrite-prep to master June 23, 2023 23:16
@victorlin victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 5a9e0cc to 28754e3 Compare June 27, 2023 18:31
@codecov
Copy link

codecov bot commented Jun 27, 2023

Codecov Report

Attention: 80 lines in your changes are missing coverage. Please review.

Comparison is base (e2ca468) 67.67% compared to head (ee1a62b) 70.26%.
Report is 1 commits behind head on master.

❗ Current head ee1a62b differs from pull request most recent head 6b4b9f1. Consider uploading reports for the commit 6b4b9f1 to get more accurate results

Files Patch % Lines
augur/io/tabular_file.py 66.66% 10 Missing and 9 partials ⚠️
augur/filter/io.py 89.02% 16 Missing and 2 partials ⚠️
augur/filter/include_exclude_rules.py 90.74% 8 Missing and 7 partials ⚠️
augur/filter/debug.py 56.25% 6 Missing and 1 partial ⚠️
augur/filter/dates.py 94.28% 6 Missing ⚠️
augur/io/sqlite3.py 93.33% 6 Missing ⚠️
augur/filter/_run.py 86.95% 1 Missing and 2 partials ⚠️
augur/dates/__init__.py 77.77% 1 Missing and 1 partial ⚠️
augur/filter/report.py 96.77% 1 Missing and 1 partial ⚠️
augur/io/metadata.py 81.81% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1242      +/-   ##
==========================================
+ Coverage   67.67%   70.26%   +2.59%     
==========================================
  Files          69       71       +2     
  Lines        7518     7440      -78     
  Branches     1844     1775      -69     
==========================================
+ Hits         5088     5228     +140     
+ Misses       2158     1895     -263     
- Partials      272      317      +45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@victorlin victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch 5 times, most recently from 64f9325 to ee1a62b Compare July 7, 2023 23:57

if args.output_strains:
output_strains.close()
import_metadata(args.metadata, args.metadata_id_columns, args.metadata_delimiters)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One significant change in this rewrite is that the metadata is no longer processed in chunks. This has implementation benefits, but comes with the downside of substantially increased disk usage.

Previously, there was the issue of augur filter using too much memory, which prompted the current chunked implementation. Now, I'm worried about augur filter using too much disk space, especially for large datasets such as ncov/open/metadata.tsv.zst, which results in a 12GB database file based on a trial run just now. This should be addressed before merging.

Two ideas to address this:

  1. Make improvements to minimize the amount of size of the database file. The easy win here is to read a subset of the metadata columns, since most usages of augur filter only use a few columns for computation, while the remainder is only used for final output. I started on this in bc2a030. The difficult part will be determining which columns are used by --query-pandas/--query-sqlite.
  2. Warn users if they don't have enough disk space. This requires an estimation of how large the database file will be, which isn't trivial. I started on this in b411c58.

I think (1) is worthwhile, and might remove the need for (2) if disk usage is reduced significantly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally concur with you here. Some ideas for implementing the disk usage reduction mitigations of (1)…

For determining which columns are used by --query-sqlite, two possibilities come to mind:

  1. Parse the SQL query into tokens/an AST and inspect that. Two examples of where I did this previously using the sqlparse Python package: checking usage of transactions and checking usage of SET search_path statements. While both split a SQL script into a list of statements, they take two tacks from there: the first walks the tokens in each statement, the second matches a regex against each statement. We'd probably want the first tack here.
  2. Use SQLite's introspection capabilities to ask what the query will use/do (e.g. with EXPLAIN or other methods). I've done this manually before in other database systems (with structured EXPLAIN output), but not programmatically in SQLite. It may not be possible and/or feasible, but seems worth looking into esp. if (1) isn't workable for some reason.

For determining what's used by --query-pandas, I'd think to either:

  1. Leverage Pandas' internals here (if reasonably possible), as I believe you've prototyped elsewhere…
  2. Implement a minimal parser for the Pandas query grammar, e.g. using pyparsing. This doesn't have to support the whole grammar, just what's sufficient to extract column names. In a similar vein, I took this approach for Markdown modification in Nextstrain CLI.
  3. Rig up a pandas.DataFrame with monkeypatching, subclassing, or a proxy object that monitors and inspects column access and then apply the query to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsibley thanks for the suggestions! As you mentioned elsewhere, Python's ast library seems to work well for pandas queries. I've successfully implemented (1) loading a subset of columns in #1294. If that gets merged, I'll rework this PR around those changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates:

  • For --query-sqlite, I've started implementation using sqlparse in 3cf5670.
  • For --query-pandas using ast, that was recently added in 0c5a137, released, broke some workflows, and now we are trying to leverage Pandas internals (again, but hopefully better).

@victorlin victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from ee1a62b to 60dadc9 Compare February 3, 2024 00:52
@victorlin victorlin changed the base branch from master to victorlin/improved-filter-io February 3, 2024 00:52
@victorlin
Copy link
Member Author

victorlin commented Feb 3, 2024

Rebased this PR onto #1294 since it builds on those changes.

I am also becoming less confident about supporting a SQLite-backed --query so I've gathered related changes into 3cf5670. The existing --query is still supported by loading metadata from the database into DataFrame chunks.

@victorlin victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch 4 times, most recently from 8d4b9e2 to 2c13fcf Compare February 7, 2024 00:10
@victorlin victorlin force-pushed the victorlin/improved-filter-io branch from 0924b31 to 1d52bc5 Compare February 7, 2024 00:10
@victorlin victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 2c13fcf to 710c000 Compare February 7, 2024 01:25
@victorlin victorlin force-pushed the victorlin/improved-filter-io branch 2 times, most recently from 5d1c045 to b56f699 Compare February 8, 2024 02:14
Base automatically changed from victorlin/improved-filter-io to master February 8, 2024 02:57
@victorlin victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 710c000 to 6b4b9f1 Compare February 13, 2024 01:29
@victorlin victorlin mentioned this pull request Feb 16, 2024
This will be used in future commits by a new implementation of augur
filter.
These will be used in future commits by a new implementation of augur
filter.
This serves multiple purposes:

1. Detect duplicates.
2. Speed up strain-based queries.
3. Provide a marker for a single primary index column (akin to a pandas
   DataFrame index column).

Also adds DuplicateError, a new exception class to expose duplicates for
custom error messages.

This will be used in a future commit by a new implementation of augur
filter.
This is unused but can come in handy when debugging a query.
…alue()

This will be used in a future commit by a new implementation of augur
filter.
… constants.py

These variables are temporary to bridge the transition from using shared
variables to a shared database. They will be removed in a future commit.
WARNING: This change won't work as-is. Broken out as a separate commit
for easier review.

Remove metadata input and examples in include_exclude_rules. These are
not applicable in the SQLite-based implementation.
WARNING: This change won't work as-is. Broken out as a separate commit
for easier review.

Remove sequence index parameter in include_exclude_rules. These are not
applicable in the SQLite-based implementation. It also removes the need
to check for DataFrame instances in kwargs.
This is the final change that replaces the Pandas-based implementation
of augur filter with a SQLite3-based implementation.

Breaking changes in behavior (see changes under tests/):

1. `--subsample-seed` is still deterministic but differs from previous
   implementation.
2. `--include*`: Previously, both exclusion and force-inclusion would be
   shown in console output and `--output-log`. Now, only the
   force-inclusion is reported.

Implementation differences with no functional changes:

1. Tabular files are loaded into tables in a temporary SQLite3 database
   file on disk rather than Pandas DataFrames. This generally means less
   memory usage and more disk usage. Tables are indexed on strain.
2. Since chunked loading of metadata was introduced to avoid high memory
   usage¹, that is no longer necessary and all operations are now on the
   entire metadata (except for `--query`/`--query-pandas`).
3. For large datasets, the SQLite3 implementation is much faster than
   the Pandas implementation.
4. Instead of relying on continually updated variables (e.g.
   `valid_strains`), new tables in the database are created at various
   stages in the process. The "filter reason table" is used as a source
   of truth for all outputs (and is conveniently a ~direct
   representation of `--output-log`). This also allows the function
   `augur.filter._run.run()` to be further broken into smaller parts.
5. Exclusion/inclusion is done using WHERE expressions.
6. For subsampling, priority queues are no longer necessary, as the the
   highest priority strains can be determined using a ORDER BY across
   all strains.
7. Date parsing has improved with caching and a a min/max approach to
   resolving date ranges.

Note that sequence I/O remains unchanged.

¹ 87ca73c
The re-implementation of augur filter accounts for this directly in
filter_by_ambiguous_date().

Remove old tests and add equivalent tests. Note that some date strings
have been changed since the previous values would fail date format
validation.
Now that disk space usage is uncapped (compared to previous pandas
implementation using in-memory chunks), it can be useful to know how
large the database file was.

However, I'll have to think more about how useful this is once database
files are passed in by the user. Ideas:

- Mark this as an experimental feature for `augur filter`, to be changed
  or removed with any version.
- Add it to the `augur db` interface, e.g. output of `augur db inspect`.
  However, it can still be useful to know the sizes of "intermediate"
  tables.

It'd also be useful to add runtime information here. Ideas:

- Print run times of each "major" function in real-time. This can
  probably be achieved by some sort of decorator function.
This adds a new flag to query the SQLite database natively.
`--query`/`--query-pandas` will still behave as expected.

All Pandas-based query functions are renamed to be Pandas-specific.

To avoid breaking changes, alias `--query` to `--query-pandas`.
@victorlin victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 6b4b9f1 to 99c5a0d Compare September 9, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants