Feature: Support indexing column and refreshing data by time the in the cache layer #173

kokokuo · 2023-05-24T12:51:37Z

Description

What’s the problem you're trying to solve

In #150, In order to enhance our query performance after users send the API request to run our data endpoint to get the result from the data source. We need to provide a Caching (pre-loading) Layer with the duckDB to enhance query performance.

Based on #154, after defining the cache layer loader to support loading parquet files to in-memory duckDB, we also need to enhance two parts:

support the cache layer data refresh by our API schema that defined refrehTime or refreshExpressionoptions in cache settings ( Note: In this issue only support refreshTime options ).
support building column index when loading the parquet files data to create cache table according to user set indexes options in cache settings or not.

# API schema YAML
cache:
  # "order" table keep in duckDB
  - cacheTableName: "order"
      sql: "select * from order",
      # optional
      profile: pg
      # optional
      refreshTime:
        every: "5m"	
      # optional
      indexes:
        'idx_a': 'col_a'
        'idx_b': 'col_b'
  
# used data source
profiles: pg

Issue ticket number

closes #164

Additional Context

The index of the duckdb should be unique, so add the validation to the checkCache middleware to check it in the same schema, otherwise, check the index naming pattern ( same as the table name ).
Based on the above, add the validation to check the index name should not be duplicated by more than one API schema ( in the CacheLayerRefresher )
We use the toad-scheduler package to handle the refresh time setting, if the cache will refresh by 5 seconds, but the task need to handle 6 seconds, then when next time the refresh schedule is triggered and discover the task is still working, the scheduler will skip and wait for next 5 seconds to check again, like below sample:

scheduler triggered time: 13:00, 13:05, 13:10, 13:15:
- 13:00 > handle task and 13:06 finished the task, so the scheduler will skip the 13:05 trigger event
- 13:10 > handle task and 13:16 finished the task, so the scheduler will skip the 13:15 trigger event

We use the jest faker timer to test CacheLayerRefresher, because our task is a promise task, so we create a flushPromise helper function to flush all pending promises in the event loop, please refer: https://gist.github.com/apieceofbart/e6dea8d884d29cf88cdb54ef14ddbcc4
We add .eslintignore to prevent linting the sub node_modules in the mono repo package, here indicating the extension-driver-snowflake, please refer Error: Failed to load config "@ljharb" to extend from. eslint/eslint#14544
Extend the timeout to 30m to prevent CircleCI default timeout (10m) reached and stop the long testing and cause failed. ( Our tests will spent more time follow by ourt test cases increased )

vercel · 2023-05-24T12:51:41Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
vulcan-sql-document	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 26, 2023 8:12am

…or cache layer feature - refactor to create "CacheLayerRefresher" to handle the refresh by scheduler, and make the "CacheLayerLoader" focus on export and load to cache layer storage logistic by "load" method - make "CacheLayerRefresher" run immediately default when add the job of loading cache. - refactor the logger info in duckdb data source. - update the test case of cache layer loader to focus on testing the export and load - add the cache layer refresher test cases to test non-refreshed options and refreshed by time options - add the test cases of setting the refresh by time options in integration-testing package. - support closing the scheduler of cache layer refresher when server close by Ctrl + C or terminated signals

codecov-commenter · 2023-05-25T03:28:36Z

Codecov Report

Patch coverage: 88.69% and project coverage change: -0.01 ⚠️

Comparison is base (885720c) 91.13% compared to head (88ea1a7) 91.12%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #173      +/-   ##
===========================================
- Coverage    91.13%   91.12%   -0.01%     
===========================================
  Files          314      316       +2     
  Lines         5006     5091      +85     
  Branches       668      679      +11     
===========================================
+ Hits          4562     4639      +77     
- Misses         316      321       +5     
- Partials       128      131       +3

Flag	Coverage Δ
build	`90.50% <100.00%> (+0.09%)`	⬆️
catalog-server	`100.00% <ø> (ø)`
cli	`81.48% <ø> (ø)`
core	`94.18% <89.36%> (-0.18%)`	⬇️
extension-dbt	`97.43% <ø> (ø)`
extension-debug-tools	`98.11% <ø> (ø)`
extension-driver-bq	`85.41% <ø> (ø)`
extension-driver-duckdb	`96.61% <100.00%> (+0.14%)`	⬆️
extension-driver-pg	`96.11% <ø> (ø)`
extension-driver-snowflake	`96.26% <ø> (ø)`
integration-testing	`90.27% <ø> (ø)`
serve	`86.99% <57.14%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
packages/core/src/containers/types.ts	`100.00% <ø> (ø)`
packages/core/src/models/extensions/dataSource.ts	`82.35% <ø> (ø)`
packages/serve/src/lib/server.ts	`66.66% <57.14%> (+2.25%)`	⬆️
packages/core/test/cache-layer/mockDataSource.ts	`81.57% <81.57%> (ø)`
...es/core/src/lib/cache-layer/cacheLayerRefresher.ts	`92.30% <92.30%> (ø)`
...ild/src/lib/schema-parser/middleware/checkCache.ts	`92.00% <100.00%> (+0.82%)`	⬆️
packages/core/src/containers/modules/cacheLayer.ts	`100.00% <100.00%> (ø)`
...kages/core/src/lib/cache-layer/cacheLayerLoader.ts	`100.00% <100.00%> (+6.06%)`	⬆️
packages/core/src/lib/cache-layer/index.ts	`100.00% <100.00%> (ø)`
...xtension-driver-duckdb/src/lib/duckdbDataSource.ts	`95.69% <100.00%> (+0.24%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

onlyjackfrost

Besides some questions, others LGTM.

onlyjackfrost · 2023-05-26T02:54:22Z

.circleci/config.yml

-      - run: yarn nx affected --target=test --ci --coverage --coverageReporters=lcov ${AFFECTED_ARGS}
+      - run:
+          # extend the timeout to 30m to prevent circleci default timeout (10m) reached and stop the long testing.
+          no_output_timeout: 30m


Is this timeout setting still necessary?

Thanks for pointing this out, I think we could remove it now because this PR seems not to exceed the timeout when I add new test cases

onlyjackfrost · 2023-05-26T03:05:33Z

packages/build/test/schema-parser/middleware/checkCache.spec.ts

-    metadata: {
-      'cache.vulcan.com': {
-        isUsedTag: true,
+it.each([


Thanks for refactoring this using Jest native method!

onlyjackfrost · 2023-05-26T04:28:14Z

packages/extension-driver-duckdb/src/lib/duckdbDataSource.ts

+  ) {
+    Object.entries(indexes).forEach(([index, column]) => {
+      // duckdb could not create duplicate index, so here should drop index if exists and create again.
+      connection.run(`DROP INDEX IF EXISTS ${index} ON ${schema}.${tableName}`);


Could you tell me the purpose of recreating an index with the same name on the same table?

Thanks for reviewing and asking, the reason is that we will refresh and call the importing method again, but the duckdb could create duplicate index names.

Although we could query whether the index from duckdb and check it exists. But it may do more steps:

query the table

get the result and do check if exist

If not exist, create the index

So, for the simple way, we could drop and recreate directly.

PS: I discovered my DROP INDEX IF EXISTS ${index} ON ${schema}.${tableName} syntax is inccoret, the duckdb couldn't drop the index on the specific table. https://duckdb.org/docs/sql/indexes#drop-index

Thanks for @onlyjackfrost feedback the worried cost a lot of time when dropping the index if the table has thousands of thousands ( seems it's an issue in OLTP database, but not sure for OLAP database), we could arrange the performance testing in the future and if it's a issue, we will change back to the above checking steps.

…n schema for cache layer feature - add validation for checking index name should be unique and index name format pattern in "CheckCache" middleware of the build package. - update the test cases for checking index in the "CheckCache" middleware. - support creating index in the duckdb cache layer at duckdb data source. - add test cases to test created index in the duckdb data source. - support checking index name should not duplicate more than on schema files in the "CacheLayerRefresher". - add test cases to "CacheLayerRefresher". - add .eslintignore to prevent liniting the sub node_modules in the monorepo package, here indicating the extension-driver-snowflake.

onlyjackfrost

Thanks for answering, LGTM

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 81cf156 to 364421c Compare May 24, 2023 12:58

vercel bot deployed to Preview May 24, 2023 12:58 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 364421c to 27c75b6 Compare May 24, 2023 13:00

vercel bot deployed to Preview May 24, 2023 13:01 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 27c75b6 to beae905 Compare May 24, 2023 13:25

vercel bot deployed to Preview May 24, 2023 13:25 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from beae905 to 9f5a571 Compare May 24, 2023 13:37

vercel bot deployed to Preview May 24, 2023 13:37 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 9f5a571 to dfb35b3 Compare May 24, 2023 14:09

vercel bot deployed to Preview May 24, 2023 14:09 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from dfb35b3 to 092cb26 Compare May 24, 2023 14:10

vercel bot deployed to Preview May 24, 2023 14:10 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 092cb26 to c553809 Compare May 24, 2023 14:12

vercel bot deployed to Preview May 24, 2023 14:12 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from c553809 to 636560c Compare May 24, 2023 14:31

vercel bot deployed to Preview May 24, 2023 14:31 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 636560c to ef8a8e4 Compare May 24, 2023 14:54

vercel bot deployed to Preview May 24, 2023 14:55 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from ef8a8e4 to 69eeb78 Compare May 25, 2023 01:39

vercel bot deployed to Preview May 25, 2023 01:39 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 69eeb78 to c37233c Compare May 25, 2023 02:51

vercel bot deployed to Preview May 25, 2023 02:51 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from c37233c to 38545f8 Compare May 25, 2023 03:14

vercel bot deployed to Preview May 25, 2023 03:15 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 38545f8 to 4c2b8a9 Compare May 25, 2023 03:18

vercel bot deployed to Preview May 25, 2023 03:18 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 4c2b8a9 to fbd6374 Compare May 25, 2023 03:20

vercel bot deployed to Preview May 25, 2023 03:21 View deployment

vercel bot deployed to Preview May 25, 2023 03:29 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 6616687 to 5e88c7c Compare May 25, 2023 03:37

vercel bot deployed to Preview May 25, 2023 03:37 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 5e88c7c to 1f311c6 Compare May 25, 2023 03:57

vercel bot deployed to Preview May 25, 2023 03:58 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 1f311c6 to df53d65 Compare May 25, 2023 05:45

vercel bot deployed to Preview May 25, 2023 05:45 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from df53d65 to 110e529 Compare May 25, 2023 06:08

vercel bot deployed to Preview May 25, 2023 06:08 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 110e529 to 88ea1a7 Compare May 25, 2023 06:42

vercel bot deployed to Preview May 25, 2023 06:42 View deployment

kokokuo requested review from onlyjackfrost, wwwy3y3, andreashimin, JSYOU and fredalai and removed request for onlyjackfrost May 25, 2023 07:33

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 88ea1a7 to 39ddd79 Compare May 25, 2023 07:42

vercel bot deployed to Preview May 25, 2023 07:42 View deployment

onlyjackfrost reviewed May 26, 2023

View reviewed changes

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from 39ddd79 to bd18efa Compare May 26, 2023 08:10

vercel bot deployed to Preview May 26, 2023 08:10 View deployment

kokokuo force-pushed the feature/indexing-refreshing-cache-layer branch from bd18efa to 1d1d578 Compare May 26, 2023 08:12

vercel bot deployed to Preview May 26, 2023 08:12 View deployment

onlyjackfrost approved these changes May 29, 2023

View reviewed changes

kokokuo merged commit 2b5e282 into develop May 29, 2023

hanshino deleted the feature/indexing-refreshing-cache-layer branch January 31, 2024 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Support indexing column and refreshing data by time the in the cache layer #173

Feature: Support indexing column and refreshing data by time the in the cache layer #173

kokokuo commented May 24, 2023 •

edited

Loading

vercel bot commented May 24, 2023 •

edited

Loading

codecov-commenter commented May 25, 2023 •

edited

Loading

onlyjackfrost left a comment

onlyjackfrost May 26, 2023

kokokuo May 26, 2023

onlyjackfrost May 26, 2023

onlyjackfrost May 26, 2023

kokokuo May 26, 2023 •

edited

Loading

kokokuo May 29, 2023

onlyjackfrost left a comment

Feature: Support indexing column and refreshing data by time the in the cache layer #173

Feature: Support indexing column and refreshing data by time the in the cache layer #173

Conversation

kokokuo commented May 24, 2023 • edited Loading

Description

What’s the problem you're trying to solve

Issue ticket number

Additional Context

vercel bot commented May 24, 2023 • edited Loading

codecov-commenter commented May 25, 2023 • edited Loading

Codecov Report

onlyjackfrost left a comment

Choose a reason for hiding this comment

onlyjackfrost May 26, 2023

Choose a reason for hiding this comment

kokokuo May 26, 2023

Choose a reason for hiding this comment

onlyjackfrost May 26, 2023

Choose a reason for hiding this comment

onlyjackfrost May 26, 2023

Choose a reason for hiding this comment

kokokuo May 26, 2023 • edited Loading

Choose a reason for hiding this comment

kokokuo May 29, 2023

Choose a reason for hiding this comment

onlyjackfrost left a comment

Choose a reason for hiding this comment

kokokuo commented May 24, 2023 •

edited

Loading

vercel bot commented May 24, 2023 •

edited

Loading

codecov-commenter commented May 25, 2023 •

edited

Loading

kokokuo May 26, 2023 •

edited

Loading