Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Support indexing column and refreshing data by time the in the cache layer #173

Merged
merged 2 commits into from
May 29, 2023

Conversation

kokokuo
Copy link
Contributor

@kokokuo kokokuo commented May 24, 2023

Description

What’s the problem you're trying to solve

In #150, In order to enhance our query performance after users send the API request to run our data endpoint to get the result from the data source. We need to provide a Caching (pre-loading) Layer with the duckDB to enhance query performance.

Based on #154, after defining the cache layer loader to support loading parquet files to in-memory duckDB, we also need to enhance two parts:

  1. support the cache layer data refresh by our API schema that defined refrehTime or refreshExpressionoptions in cache settings ( Note: In this issue only support refreshTime options ).
  2. support building column index when loading the parquet files data to create cache table according to user set indexes options in cache settings or not.
# API schema YAML
cache:
  # "order" table keep in duckDB
  - cacheTableName: "order"
      sql: "select * from order",
      # optional
      profile: pg
      # optional
      refreshTime:
        every: "5m"	
      # optional
      indexes:
        'idx_a': 'col_a'
        'idx_b': 'col_b'
  
# used data source
profiles: pg

Issue ticket number

closes #164

Additional Context

  1. The index of the duckdb should be unique, so add the validation to the checkCache middleware to check it in the same schema, otherwise, check the index naming pattern ( same as the table name ).
  2. Based on the above, add the validation to check the index name should not be duplicated by more than one API schema ( in the CacheLayerRefresher )
  3. We use the toad-scheduler package to handle the refresh time setting, if the cache will refresh by 5 seconds, but the task need to handle 6 seconds, then when next time the refresh schedule is triggered and discover the task is still working, the scheduler will skip and wait for next 5 seconds to check again, like below sample:
scheduler triggered time: 13:00, 13:05, 13:10, 13:15:
- 13:00 > handle task and 13:06 finished the task, so the scheduler will skip the 13:05 trigger event
- 13:10 > handle task and 13:16 finished the task, so the scheduler will skip the 13:15 trigger event
  1. We use the jest faker timer to test CacheLayerRefresher, because our task is a promise task, so we create a flushPromise helper function to flush all pending promises in the event loop, please refer: https://gist.github.com/apieceofbart/e6dea8d884d29cf88cdb54ef14ddbcc4
  2. We add .eslintignore to prevent linting the sub node_modules in the mono repo package, here indicating the extension-driver-snowflake, please refer Error: Failed to load config "@ljharb" to extend from. eslint/eslint#14544
  3. Extend the timeout to 30m to prevent CircleCI default timeout (10m) reached and stop the long testing and cause failed. ( Our tests will spent more time follow by ourt test cases increased )

@vercel
Copy link

vercel bot commented May 24, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
vulcan-sql-document ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 26, 2023 8:12am

@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 81cf156 to 364421c Compare May 24, 2023 12:58
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 364421c to 27c75b6 Compare May 24, 2023 13:00
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 27c75b6 to beae905 Compare May 24, 2023 13:25
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from beae905 to 9f5a571 Compare May 24, 2023 13:37
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 9f5a571 to dfb35b3 Compare May 24, 2023 14:09
…or cache layer feature

- refactor to create "CacheLayerRefresher" to handle the refresh by scheduler, and make the "CacheLayerLoader" focus on export and load to cache layer storage logistic by "load" method
- make "CacheLayerRefresher" run immediately default when add the job of loading cache.
- refactor the logger info in duckdb data source.
- update the test case of cache layer loader to focus on testing the export and load
- add the cache layer refresher test cases to test non-refreshed options and refreshed by time options
- add the test cases of setting the refresh by time options in integration-testing package.
- support closing the scheduler of cache layer refresher when server close by Ctrl + C or terminated signals
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from dfb35b3 to 092cb26 Compare May 24, 2023 14:10
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 092cb26 to c553809 Compare May 24, 2023 14:12
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from c553809 to 636560c Compare May 24, 2023 14:31
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 636560c to ef8a8e4 Compare May 24, 2023 14:54
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from ef8a8e4 to 69eeb78 Compare May 25, 2023 01:39
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 69eeb78 to c37233c Compare May 25, 2023 02:51
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from c37233c to 38545f8 Compare May 25, 2023 03:14
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 38545f8 to 4c2b8a9 Compare May 25, 2023 03:18
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 4c2b8a9 to fbd6374 Compare May 25, 2023 03:20
@codecov-commenter
Copy link

codecov-commenter commented May 25, 2023

Codecov Report

Patch coverage: 88.69% and project coverage change: -0.01 ⚠️

Comparison is base (885720c) 91.13% compared to head (88ea1a7) 91.12%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #173      +/-   ##
===========================================
- Coverage    91.13%   91.12%   -0.01%     
===========================================
  Files          314      316       +2     
  Lines         5006     5091      +85     
  Branches       668      679      +11     
===========================================
+ Hits          4562     4639      +77     
- Misses         316      321       +5     
- Partials       128      131       +3     
Flag Coverage Δ
build 90.50% <100.00%> (+0.09%) ⬆️
catalog-server 100.00% <ø> (ø)
cli 81.48% <ø> (ø)
core 94.18% <89.36%> (-0.18%) ⬇️
extension-dbt 97.43% <ø> (ø)
extension-debug-tools 98.11% <ø> (ø)
extension-driver-bq 85.41% <ø> (ø)
extension-driver-duckdb 96.61% <100.00%> (+0.14%) ⬆️
extension-driver-pg 96.11% <ø> (ø)
extension-driver-snowflake 96.26% <ø> (ø)
integration-testing 90.27% <ø> (ø)
serve 86.99% <57.14%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
packages/core/src/containers/types.ts 100.00% <ø> (ø)
packages/core/src/models/extensions/dataSource.ts 82.35% <ø> (ø)
packages/serve/src/lib/server.ts 66.66% <57.14%> (+2.25%) ⬆️
packages/core/test/cache-layer/mockDataSource.ts 81.57% <81.57%> (ø)
...es/core/src/lib/cache-layer/cacheLayerRefresher.ts 92.30% <92.30%> (ø)
...ild/src/lib/schema-parser/middleware/checkCache.ts 92.00% <100.00%> (+0.82%) ⬆️
packages/core/src/containers/modules/cacheLayer.ts 100.00% <100.00%> (ø)
...kages/core/src/lib/cache-layer/cacheLayerLoader.ts 100.00% <100.00%> (+6.06%) ⬆️
packages/core/src/lib/cache-layer/index.ts 100.00% <100.00%> (ø)
...xtension-driver-duckdb/src/lib/duckdbDataSource.ts 95.69% <100.00%> (+0.24%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@onlyjackfrost onlyjackfrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides some questions, others LGTM.

- run: yarn nx affected --target=test --ci --coverage --coverageReporters=lcov ${AFFECTED_ARGS}
- run:
# extend the timeout to 30m to prevent circleci default timeout (10m) reached and stop the long testing.
no_output_timeout: 30m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this timeout setting still necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I think we could remove it now because this PR seems not to exceed the timeout when I add new test cases

metadata: {
'cache.vulcan.com': {
isUsedTag: true,
it.each([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring this using Jest native method!

) {
Object.entries(indexes).forEach(([index, column]) => {
// duckdb could not create duplicate index, so here should drop index if exists and create again.
connection.run(`DROP INDEX IF EXISTS ${index} ON ${schema}.${tableName}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you tell me the purpose of recreating an index with the same name on the same table?

Copy link
Contributor Author

@kokokuo kokokuo May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing and asking, the reason is that we will refresh and call the importing method again, but the duckdb could create duplicate index names.

Although we could query whether the index from duckdb and check it exists. But it may do more steps:

  1. query the table
  2. get the result and do check if exist
  3. If not exist, create the index

So, for the simple way, we could drop and recreate directly.

PS: I discovered my DROP INDEX IF EXISTS ${index} ON ${schema}.${tableName} syntax is inccoret, the duckdb couldn't drop the index on the specific table. https://duckdb.org/docs/sql/indexes#drop-index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for @onlyjackfrost feedback the worried cost a lot of time when dropping the index if the table has thousands of thousands ( seems it's an issue in OLTP database, but not sure for OLAP database), we could arrange the performance testing in the future and if it's a issue, we will change back to the above checking steps.

@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 39ddd79 to bd18efa Compare May 26, 2023 08:10
…n schema for cache layer feature

- add validation for checking index name should be unique and index name format pattern in "CheckCache" middleware of the build package.
- update the test cases for checking index in the "CheckCache" middleware.
- support creating index in the duckdb cache layer at duckdb data source.
- add test cases to test created index in the duckdb data source.
- support checking index name should not duplicate more than on schema files in the "CacheLayerRefresher".
- add test cases to "CacheLayerRefresher".
- add .eslintignore to prevent liniting the sub node_modules in the monorepo package, here indicating the extension-driver-snowflake.
@kokokuo kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from bd18efa to 1d1d578 Compare May 26, 2023 08:12
Copy link
Contributor

@onlyjackfrost onlyjackfrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering, LGTM

@kokokuo kokokuo merged commit 2b5e282 into develop May 29, 2023
@hanshino hanshino deleted the feature/indexing-refreshing-cache-layer branch January 31, 2024 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support indexing column and refreshing data by time the in the cache layer
3 participants