Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement full text search #4416

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Implement full text search #4416

wants to merge 16 commits into from

Conversation

acquamarin
Copy link
Collaborator

@acquamarin acquamarin commented Oct 24, 2024

This PR implements full-text-search in KUZU which is described in #4269.
To create a FTS index:
CALL CREATE_FTS_INDEX(TABLE_NAME, INDEX_NAME, [PROPERTY1, PROPERTY2, ...])
To query a FTS index:
CALL QUERY_FTS_INDEX(TABLE_NAME, INDEX_NAME, QUERY_STRING)
The QUERY_FTS_INDEX function should return documents and their BM25 match score with the query in an unsorted order. Documents without scores (NULL score) is not output right now.
To drop a FTS index:
CALL DROP_FTS_INDEX(TABLE_NAME, INDEX_NAME)

The create_fts_index and drop_fts_index functions are treated as standalone functions, and will be rewritten during the parser stage.

The query_fts_index function is currently rewritten as an internal gds function call fts.
TODOS:

Feature:

  1. Drop table should drop all indexes built on that table
  2. Support copy in manual transaction.
  3. Support optional parameters in standalone call functions. (e.g. stopwords, stem)
  4. Support virtual tables.
  5. Disallow users to query virtual tables/properties and helper functions/internal fts function.
  6. Support multi-statements queries in testing framework.
  7. Support querying a subset of columns in query_fts_index()
  8. Output nodes with null score.
  9. Disallow users drop a property which has an index on it.
  10. Remove edgeCompute to compute the scores. Do this in a single threaded manner.
  11. Once edgeCompute is removed, instead of passing the semi-masker to identify the terms, pass in the terms in a sparse map, e.g., ValueVector. Both the construction of the semimask and the looping over it is redundant assuming queries will have a few terms in them. So we can just loop over a sparse map instead.

Optimization:

  1. Instead of having a separate doc table, we can store the len as a private filed in the node table.
  2. One way rel table.
  3. QUERY_FTS front end optimizer.

Copy link

codecov bot commented Oct 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.48%. Comparing base (aef137b) to head (5228a31).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4416      +/-   ##
==========================================
- Coverage   86.57%   86.48%   -0.10%     
==========================================
  Files        1363     1370       +7     
  Lines       57650    57965     +315     
  Branches     7160     7189      +29     
==========================================
+ Hits        49911    50129     +218     
- Misses       7565     7662      +97     
  Partials      174      174              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@acquamarin acquamarin marked this pull request as ready for review October 26, 2024 02:19
extension/fts/CMakeLists.txt Show resolved Hide resolved
extension/fts/src/function/create_fts_index.cpp Outdated Show resolved Hide resolved

static std::unique_ptr<TableFuncBindData> bindFunc(ClientContext* context,
ScanTableFuncBindInput* input) {
std::vector<std::string> columnNames;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sometimes we use the term column (e.g., here) and sometimes property (e.g., above function). It's probably better to just use the term property for consistency. So I'd rename to propertyName and propertyType.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tableFunctions, we always use the term columns rather than property.

dataset/fts-small/vDoc.csv Show resolved Hide resolved
src/processor/map/map_property_collector.cpp Show resolved Hide resolved
namespace kuzu {
namespace parser {

std::string StandaloneCallAnalyzer::getRewriteQuery(const Statement& statement) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. @andyfengHKU should review this but I want to understand too.

@@ -140,6 +142,18 @@ void Planner::planGDSCall(const BoundReadingClause& readingClause,
auto gdsCall = getGDSCall(call.getInfo());
gdsCall->computeFactorizedSchema();
probePlan.setLastOperator(gdsCall);
if (gdsCall->constPtrCast<LogicalGDSCall>()->getInfo().func.name == "FTS") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for @andyfengHKU to review.

@@ -15,6 +15,7 @@ bool isValidEntry(parser::DropInfo& dropInfo, main::ClientContext* context) {
case common::DropType::SEQUENCE: {
validEntry = context->getCatalog()->containsSequence(context->getTx(), dropInfo.name);
} break;
// TODO(Ziyi): If the table has indexes, we should drop those indexes as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove this TODO here but only put it into an issue that keeps track of the TODOs.

@@ -22,15 +22,15 @@ bool CSRNodeGroupScanState::tryScanCachedTuples(RelTableScanState& tableScanStat
const auto startCSROffset = header->getStartCSROffset(boundNodeOffsetInGroup);
const auto csrLength = header->getCSRLength(boundNodeOffsetInGroup);
nextCachedRowToScan = std::max(nextCachedRowToScan, startCSROffset);
if (nextCachedRowToScan >= numScannedRows ||
nextCachedRowToScan < numScannedRows - numCachedRows) {
if (nextCachedRowToScan >= nextRowToScan ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you doing this renaming in this PR? Can you revert? Or get @ray6080 to review this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is from @benjaminwinger

@@ -128,7 +129,7 @@ void NodeGroup::initializeScanState(Transaction*, const UniqLock& lock,
TableScanState& state) const {
auto& nodeGroupScanState = *state.nodeGroupScanState;
nodeGroupScanState.chunkedGroupIdx = 0;
nodeGroupScanState.numScannedRows = 0;
nodeGroupScanState.nextRowToScan = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about reverting this or getting @ray6080 to review this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, from @benjaminwinger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants