Implement full text search #4416

acquamarin · 2024-10-24T20:55:03Z

This PR implements full-text-search in KUZU which is described in #4269.
To create a FTS index:
CALL CREATE_FTS_INDEX(TABLE_NAME, INDEX_NAME, [PROPERTY1, PROPERTY2, ...])
To query a FTS index:
CALL QUERY_FTS_INDEX(TABLE_NAME, INDEX_NAME, QUERY_STRING)
The QUERY_FTS_INDEX function should return documents and their BM25 match score with the query in an unsorted order. Documents without scores (NULL score) is not output right now.
To drop a FTS index:
CALL DROP_FTS_INDEX(TABLE_NAME, INDEX_NAME)

The create_fts_index and drop_fts_index functions are treated as standalone functions, and will be rewritten during the parser stage.

The query_fts_index function is currently rewritten as an internal gds function call fts.
TODOS:

Feature:

Drop table should drop all indexes built on that table
Support copy in manual transaction.
Support optional parameters in standalone call functions. (e.g. stopwords, stem)
Support virtual tables.
Disallow users to query virtual tables/properties and helper functions/internal fts function.
Support multi-statements queries in testing framework.
Support querying a subset of columns in query_fts_index()
Output nodes with null score.
Disallow users drop a property which has an index on it.
Remove edgeCompute to compute the scores. Do this in a single threaded manner.
Once edgeCompute is removed, instead of passing the semi-masker to identify the terms, pass in the terms in a sparse map, e.g., ValueVector. Both the construction of the semimask and the looping over it is redundant assuming queries will have a few terms in them. So we can just loop over a sparse map instead.

Optimization:

Instead of having a separate doc table, we can store the len as a private filed in the node table.
One way rel table.
QUERY_FTS front end optimizer.

codecov · 2024-10-25T00:09:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.48%. Comparing base (aef137b) to head (5228a31).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4416      +/-   ##
==========================================
- Coverage   86.57%   86.48%   -0.10%     
==========================================
  Files        1363     1370       +7     
  Lines       57650    57965     +315     
  Branches     7160     7189      +29     
==========================================
+ Hits        49911    50129     +218     
- Misses       7565     7662      +97     
  Partials      174      174

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

extension/fts/CMakeLists.txt

extension/fts/src/function/create_fts_index.cpp

semihsalihoglu-uw · 2024-10-26T16:03:39Z

extension/fts/src/function/create_fts_index.cpp

+
+static std::unique_ptr<TableFuncBindData> bindFunc(ClientContext* context,
+    ScanTableFuncBindInput* input) {
+    std::vector<std::string> columnNames;


I think sometimes we use the term column (e.g., here) and sometimes property (e.g., above function). It's probably better to just use the term property for consistency. So I'd rename to propertyName and propertyType.

In tableFunctions, we always use the term columns rather than property.

dataset/fts-small/vDoc.csv

src/processor/map/map_property_collector.cpp

semihsalihoglu-uw · 2024-10-26T19:01:24Z

src/parser/visitor/standalone_call_analyzer.cpp

+namespace kuzu {
+namespace parser {
+
+std::string StandaloneCallAnalyzer::getRewriteQuery(const Statement& statement) {


Ditto. @andyfengHKU should review this but I want to understand too.

semihsalihoglu-uw · 2024-10-26T19:03:11Z

src/planner/plan/plan_read.cpp

@@ -140,6 +142,18 @@ void Planner::planGDSCall(const BoundReadingClause& readingClause,
            auto gdsCall = getGDSCall(call.getInfo());
            gdsCall->computeFactorizedSchema();
            probePlan.setLastOperator(gdsCall);
+            if (gdsCall->constPtrCast<LogicalGDSCall>()->getInfo().func.name == "FTS") {


This is for @andyfengHKU to review.

semihsalihoglu-uw · 2024-10-26T19:05:50Z

src/processor/operator/ddl/drop.cpp

@@ -15,6 +15,7 @@ bool isValidEntry(parser::DropInfo& dropInfo, main::ClientContext* context) {
    case common::DropType::SEQUENCE: {
        validEntry = context->getCatalog()->containsSequence(context->getTx(), dropInfo.name);
    } break;
+        // TODO(Ziyi): If the table has indexes, we should drop those indexes as well.


Maybe remove this TODO here but only put it into an issue that keeps track of the TODOs.

semihsalihoglu-uw · 2024-10-26T19:07:24Z

src/storage/store/csr_node_group.cpp

@@ -22,15 +22,15 @@ bool CSRNodeGroupScanState::tryScanCachedTuples(RelTableScanState& tableScanStat
    const auto startCSROffset = header->getStartCSROffset(boundNodeOffsetInGroup);
    const auto csrLength = header->getCSRLength(boundNodeOffsetInGroup);
    nextCachedRowToScan = std::max(nextCachedRowToScan, startCSROffset);
-    if (nextCachedRowToScan >= numScannedRows ||
-        nextCachedRowToScan < numScannedRows - numCachedRows) {
+    if (nextCachedRowToScan >= nextRowToScan ||


Why are you doing this renaming in this PR? Can you revert? Or get @ray6080 to review this.

This code is from @benjaminwinger

semihsalihoglu-uw · 2024-10-26T19:08:16Z

src/storage/store/node_group.cpp

@@ -128,7 +129,7 @@ void NodeGroup::initializeScanState(Transaction*, const UniqLock& lock,
    TableScanState& state) const {
    auto& nodeGroupScanState = *state.nodeGroupScanState;
    nodeGroupScanState.chunkedGroupIdx = 0;
-    nodeGroupScanState.numScannedRows = 0;
+    nodeGroupScanState.nextRowToScan = 0;


Ditto about reverting this or getting @ray6080 to review this.

Ditto, from @benjaminwinger

acquamarin and others added 6 commits October 24, 2024 15:43

Implement fts

eb58dc1

update

b3d8c1b

update

6ac1bd2

finialize

f7a075d

update

5291903

Run clang-format

8893911

benjaminwinger mentioned this pull request Oct 24, 2024

Fix bugs in GDS vertex property scanning #4417

Merged

Fix bugs in GDS vertex property scanning (#4417)

3758003

acquamarin added 7 commits October 24, 2024 21:26

Fix format

65419a7

Add more tests

3a1d0cb

update

b281e02

Fix ci

2a15b82

u

34c3164

update

72c7a63

fix

bba98f4

acquamarin marked this pull request as ready for review October 26, 2024 02:19

acquamarin requested a review from benjaminwinger as a code owner October 26, 2024 02:19

acquamarin requested review from semihsalihoglu-uw and andyfengHKU October 26, 2024 02:19

semihsalihoglu-uw requested changes Oct 26, 2024

View reviewed changes

acquamarin added 2 commits October 26, 2024 17:32

Fix comments

b5ae2a0

Fix fts execution

5228a31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement full text search #4416

Implement full text search #4416

acquamarin commented Oct 24, 2024 •

edited by semihsalihoglu-uw

Loading

codecov bot commented Oct 25, 2024 •

edited

Loading

semihsalihoglu-uw Oct 26, 2024

acquamarin Oct 26, 2024

semihsalihoglu-uw Oct 26, 2024

semihsalihoglu-uw Oct 26, 2024

semihsalihoglu-uw Oct 26, 2024

semihsalihoglu-uw Oct 26, 2024

acquamarin Oct 26, 2024

semihsalihoglu-uw Oct 26, 2024

acquamarin Oct 26, 2024

Implement full text search #4416

Are you sure you want to change the base?

Implement full text search #4416

Conversation

acquamarin commented Oct 24, 2024 • edited by semihsalihoglu-uw Loading

codecov bot commented Oct 25, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acquamarin commented Oct 24, 2024 •

edited by semihsalihoglu-uw

Loading

codecov bot commented Oct 25, 2024 •

edited

Loading