-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Support indexing column and refreshing data by time the in the cache layer #173
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
81cf156
to
364421c
Compare
364421c
to
27c75b6
Compare
27c75b6
to
beae905
Compare
beae905
to
9f5a571
Compare
9f5a571
to
dfb35b3
Compare
…or cache layer feature - refactor to create "CacheLayerRefresher" to handle the refresh by scheduler, and make the "CacheLayerLoader" focus on export and load to cache layer storage logistic by "load" method - make "CacheLayerRefresher" run immediately default when add the job of loading cache. - refactor the logger info in duckdb data source. - update the test case of cache layer loader to focus on testing the export and load - add the cache layer refresher test cases to test non-refreshed options and refreshed by time options - add the test cases of setting the refresh by time options in integration-testing package. - support closing the scheduler of cache layer refresher when server close by Ctrl + C or terminated signals
dfb35b3
to
092cb26
Compare
092cb26
to
c553809
Compare
c553809
to
636560c
Compare
636560c
to
ef8a8e4
Compare
ef8a8e4
to
69eeb78
Compare
69eeb78
to
c37233c
Compare
c37233c
to
38545f8
Compare
38545f8
to
4c2b8a9
Compare
4c2b8a9
to
fbd6374
Compare
Codecov ReportPatch coverage:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## develop #173 +/- ##
===========================================
- Coverage 91.13% 91.12% -0.01%
===========================================
Files 314 316 +2
Lines 5006 5091 +85
Branches 668 679 +11
===========================================
+ Hits 4562 4639 +77
- Misses 316 321 +5
- Partials 128 131 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
6616687
to
5e88c7c
Compare
5e88c7c
to
1f311c6
Compare
1f311c6
to
df53d65
Compare
df53d65
to
110e529
Compare
110e529
to
88ea1a7
Compare
88ea1a7
to
39ddd79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides some questions, others LGTM.
.circleci/config.yml
Outdated
- run: yarn nx affected --target=test --ci --coverage --coverageReporters=lcov ${AFFECTED_ARGS} | ||
- run: | ||
# extend the timeout to 30m to prevent circleci default timeout (10m) reached and stop the long testing. | ||
no_output_timeout: 30m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this timeout setting still necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I think we could remove it now because this PR seems not to exceed the timeout when I add new test cases
metadata: { | ||
'cache.vulcan.com': { | ||
isUsedTag: true, | ||
it.each([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for refactoring this using Jest native method!
) { | ||
Object.entries(indexes).forEach(([index, column]) => { | ||
// duckdb could not create duplicate index, so here should drop index if exists and create again. | ||
connection.run(`DROP INDEX IF EXISTS ${index} ON ${schema}.${tableName}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you tell me the purpose of recreating an index with the same name on the same table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing and asking, the reason is that we will refresh and call the importing method again, but the duckdb could create duplicate index names.
Although we could query whether the index from duckdb and check it exists. But it may do more steps:
- query the table
- get the result and do check if exist
- If not exist, create the index
So, for the simple way, we could drop and recreate directly.
PS: I discovered my DROP INDEX IF EXISTS ${index} ON ${schema}.${tableName}
syntax is inccoret, the duckdb couldn't drop the index on the specific table. https://duckdb.org/docs/sql/indexes#drop-index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for @onlyjackfrost feedback the worried cost a lot of time when dropping the index if the table has thousands of thousands ( seems it's an issue in OLTP database, but not sure for OLAP database), we could arrange the performance testing in the future and if it's a issue, we will change back to the above checking steps.
39ddd79
to
bd18efa
Compare
…n schema for cache layer feature - add validation for checking index name should be unique and index name format pattern in "CheckCache" middleware of the build package. - update the test cases for checking index in the "CheckCache" middleware. - support creating index in the duckdb cache layer at duckdb data source. - add test cases to test created index in the duckdb data source. - support checking index name should not duplicate more than on schema files in the "CacheLayerRefresher". - add test cases to "CacheLayerRefresher". - add .eslintignore to prevent liniting the sub node_modules in the monorepo package, here indicating the extension-driver-snowflake.
bd18efa
to
1d1d578
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering, LGTM
Description
What’s the problem you're trying to solve
In #150, In order to enhance our query performance after users send the API request to run our data endpoint to get the result from the data source. We need to provide a Caching (pre-loading) Layer with the duckDB to enhance query performance.
Based on #154, after defining the cache layer loader to support loading parquet files to in-memory duckDB, we also need to enhance two parts:
refrehTime
orrefreshExpression
options in cache settings ( Note: In this issue only supportrefreshTime
options ).indexes
options in cache settings or not.Issue ticket number
closes #164
Additional Context
duckdb
should be unique, so add the validation to thecheckCache
middleware to check it in the same schema, otherwise, check the index naming pattern ( same as the table name ).CacheLayerRefresher
)toad-scheduler
package to handle the refresh time setting, if the cache will refresh by 5 seconds, but the task need to handle 6 seconds, then when next time the refresh schedule is triggered and discover the task is still working, the scheduler will skip and wait for next 5 seconds to check again, like below sample:CacheLayerRefresher
, because our task is a promise task, so we create aflushPromise
helper function to flush all pending promises in the event loop, please refer: https://gist.github.com/apieceofbart/e6dea8d884d29cf88cdb54ef14ddbcc4.eslintignore
to prevent linting the subnode_modules
in the mono repo package, here indicating theextension-driver-snowflake
, please refer Error: Failed to load config "@ljharb" to extend from. eslint/eslint#14544