You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #150, In order to enhance our query performance after users send the API request to run our data endpoint to get the result from the data source. We need to provide a Caching (pre-loading) Layer with the duckDB to enhance query performance.
Describe the solution you’d like
In order to send the query result from a data source and keep in the cache layer with duckDB after users define the cache config in schema YAML, we need a method to keep the query result and pass it to duckDB by creating the table.
Export to parquet format
The method is to keep the query result in a Parquet format file, the reason is that we could make the result portable and put it in a local or remote place and share it with other users to use.
The directory format is [templateName]/[profileName]/[cachedTableName] and each data source create it own file name *(maybe a lot of partition file) [folderPartitionName].parquet and put in the directory.
The time to export result to parquet and load to duckDB
Otherwise, we decide to send the query result from a data source and keep it to parquet, and when to read the parquet files and load them to duckDB when users type vulcan serve to start our vulcan server. The reason is to prevent making the workload happen when API receives:
If not find any parquet files in the [templateName]/[profileName]/[cachedTableName] directory :
Read data sources data and export to parquet
Then read Parquet to load data to duckDB.
If exist parquet files in the [templateName]/[profileName]/[cachedTableName] direcotry, check if refreshExpression or addRefreshTime is set and triggered and do the refresh step:
Read data sources data and export to parquet
Then read Parquet to load data to duckDB.
Define CacheLayerLoader to control the flow
Here we define CacheLayerLoader class to implement delegating data sources export query results to parquet files according to artifact, and then get the parquet file and load parquet to duckDB.
The CacheLayerLoader injects all DataSource when the container is loading.
Define a method named preload :
The preload calls the export from each DataSource and keeps the location and in key mapper to keep cachedTableName and [templateName]/[profileName]/[cachedTableName] directory. Then reading all parquets in the [templateName]/[profileName]/[cachedTableName] directory for loading to duckDB is needed.
Load all parquet files data in the [templateName]/[profileName]/[cachedTableName] directory and create cacheTableName in duckDB by preload of CacheLayerLoader after all parquet exported.
Additional Context
According to the above solution, here are some tips that need to implement:
How to know where the parquets files save and get?
we could set the parameter in vulcan.yaml like template and make what the loader be our cache layer storage, export the sql result to what format and load to our cache layer.
So we need to add cache key in ICoreOptions with ICacheLayerOptions interface and a CacheLayerOptions class to validate all fields of the cache key in vulcan.yaml.
Define a method contract for exporting results to parquet in DataSource
Define a export method contract and make each data sources could implement in other issues:
exportinterfaceExportOptions{// The sql query result to exportsql: string;// The directory to export result to filedirectory: string;// The profile name to select to export dataprofileName: string;// export file format typetype: CacheLayerStoreFormatType|string;}exportinterfaceImportOptions{// The table name to create from imported file datatableName: string;// The directory to import cache filesdirectory: string;// The profile name to select to import dataprofileName: string;// default schemaschema: string;// import file format typetype: CacheLayerStoreFormatType|string;}
@VulcanExtension(TYPES.Extension_DataSource,{enforcedId: true})exportabstractclassDataSource<C=any,PROFILE=Record<string,any>>extendsExtensionBase<C>{privateprofiles: Map<string,Profile<PROFILE>>;constructor(
@inject(TYPES.ExtensionConfig)config: C,
@inject(TYPES.ExtensionName)moduleName: string,
@multiInject(TYPES.Profile) @optional()profiles: Profile[]=[]){super(config,moduleName);this.profiles=profiles.reduce((prev,curr)=>prev.set(curr.name,curr),newMap());}
..../** * Export query result data to cache file for cache layer loader used */publicexport(options: ExportOptions): Promise<void>{thrownewError(`Export method not implemented`);}/** * Import data to create table from cache file for cache layer loader used */publicimport(options: ImportOptions): Promise<void>{thrownewError(`import method not implemented`);}}
How to load parquet to duckDB code:
Here is a simple example to show how to load parquet to duckDB:
import*asduckdbfrom'duckdb';constdb=newduckdb.Database(':memory:');// create a table from a parquet file in memory.db.run('CREATE TABLE people AS SELECT * FROM read_parquet(?)','candidate.parquet',);// select all rows from the tableconststatement=db.prepare('SELECT * FROM people');constqueryResult=awaitstatement.stream();constfirstChunk=awaitqueryResult.nextChunk();// get the columns and data stream. "getColumns" and "getData" defined in vulcanconstcolumns=getColumns(firstChunk);constdataStream=getData(queryResult,firstChunk);
The text was updated successfully, but these errors were encountered:
kokokuo
changed the title
Define data cache loader to load parquet files to in-memory duckDB
Define cache layer loader to load parquet files to in-memory duckDB
Apr 13, 2023
What’s the problem you're trying to solve
In #150, In order to enhance our query performance after users send the API request to run our data endpoint to get the result from the data source. We need to provide a Caching (pre-loading) Layer with the duckDB to enhance query performance.
Describe the solution you’d like
In order to send the query result from a data source and keep in the cache layer with duckDB after users define the cache config in schema YAML, we need a method to keep the query result and pass it to duckDB by creating the table.
Export to parquet format
The method is to keep the query result in a Parquet format file, the reason is that we could make the result portable and put it in a local or remote place and share it with other users to use.
The directory format is
[templateName]/[profileName]/[cachedTableName]
and each data source create it own file name *(maybe a lot of partition file)[folderPartitionName].parquet
and put in the directory.The time to export result to parquet and load to duckDB
Otherwise, we decide to send the query result from a data source and keep it to parquet, and when to read the parquet files and load them to duckDB when users type
vulcan serve
to start our vulcan server. The reason is to prevent making the workload happen when API receives:[templateName]/[profileName]/[cachedTableName]
directory :[templateName]/[profileName]/[cachedTableName]
direcotry, check ifrefreshExpression
oraddRefreshTime
is set and triggered and do the refresh step:Define CacheLayerLoader to control the flow
Here we define
CacheLayerLoader
class to implement delegating data sources export query results to parquet files according to artifact, and then get the parquet file and load parquet to duckDB.CacheLayerLoader
injects allDataSource
when the container is loading.preload
:preload
calls theexport
from eachDataSource
and keeps the location and in key mapper to keepcachedTableName
and[templateName]/[profileName]/[cachedTableName]
directory. Then reading all parquets in the[templateName]/[profileName]/[cachedTableName]
directory for loading to duckDB is needed.[templateName]/[profileName]/[cachedTableName]
directory and createcacheTableName
in duckDB bypreload
ofCacheLayerLoader
after all parquet exported.Additional Context
According to the above solution, here are some tips that need to implement:
How to know where the parquets files save and get?
we could set the parameter in
vulcan.yaml
liketemplate
and make what the loader be our cache layer storage, export the sql result to what format and load to our cache layer.So we need to add
cache
key inICoreOptions
withICacheLayerOptions
interface and aCacheLayerOptions
class to validate all fields of thecache
key invulcan.yaml
.Define a method contract for exporting results to parquet in DataSource
Define a
export
method contract and make each data sources could implement in other issues:How to load parquet to duckDB code:
Here is a simple example to show how to load parquet to duckDB:
The text was updated successfully, but these errors were encountered: