-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement full text search #4416
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4416 +/- ##
==========================================
- Coverage 86.57% 86.48% -0.10%
==========================================
Files 1363 1370 +7
Lines 57650 57965 +315
Branches 7160 7189 +29
==========================================
+ Hits 49911 50129 +218
- Misses 7565 7662 +97
Partials 174 174 ☔ View full report in Codecov by Sentry. |
|
||
static std::unique_ptr<TableFuncBindData> bindFunc(ClientContext* context, | ||
ScanTableFuncBindInput* input) { | ||
std::vector<std::string> columnNames; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sometimes we use the term column (e.g., here) and sometimes property (e.g., above function). It's probably better to just use the term property for consistency. So I'd rename to propertyName and propertyType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tableFunctions, we always use the term columns rather than property.
namespace kuzu { | ||
namespace parser { | ||
|
||
std::string StandaloneCallAnalyzer::getRewriteQuery(const Statement& statement) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. @andyfengHKU should review this but I want to understand too.
@@ -140,6 +142,18 @@ void Planner::planGDSCall(const BoundReadingClause& readingClause, | |||
auto gdsCall = getGDSCall(call.getInfo()); | |||
gdsCall->computeFactorizedSchema(); | |||
probePlan.setLastOperator(gdsCall); | |||
if (gdsCall->constPtrCast<LogicalGDSCall>()->getInfo().func.name == "FTS") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for @andyfengHKU to review.
@@ -15,6 +15,7 @@ bool isValidEntry(parser::DropInfo& dropInfo, main::ClientContext* context) { | |||
case common::DropType::SEQUENCE: { | |||
validEntry = context->getCatalog()->containsSequence(context->getTx(), dropInfo.name); | |||
} break; | |||
// TODO(Ziyi): If the table has indexes, we should drop those indexes as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove this TODO here but only put it into an issue that keeps track of the TODOs.
@@ -22,15 +22,15 @@ bool CSRNodeGroupScanState::tryScanCachedTuples(RelTableScanState& tableScanStat | |||
const auto startCSROffset = header->getStartCSROffset(boundNodeOffsetInGroup); | |||
const auto csrLength = header->getCSRLength(boundNodeOffsetInGroup); | |||
nextCachedRowToScan = std::max(nextCachedRowToScan, startCSROffset); | |||
if (nextCachedRowToScan >= numScannedRows || | |||
nextCachedRowToScan < numScannedRows - numCachedRows) { | |||
if (nextCachedRowToScan >= nextRowToScan || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you doing this renaming in this PR? Can you revert? Or get @ray6080 to review this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is from @benjaminwinger
@@ -128,7 +129,7 @@ void NodeGroup::initializeScanState(Transaction*, const UniqLock& lock, | |||
TableScanState& state) const { | |||
auto& nodeGroupScanState = *state.nodeGroupScanState; | |||
nodeGroupScanState.chunkedGroupIdx = 0; | |||
nodeGroupScanState.numScannedRows = 0; | |||
nodeGroupScanState.nextRowToScan = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto about reverting this or getting @ray6080 to review this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, from @benjaminwinger
This PR implements full-text-search in KUZU which is described in #4269.
To create a FTS index:
CALL CREATE_FTS_INDEX(TABLE_NAME, INDEX_NAME, [PROPERTY1, PROPERTY2, ...])
To query a FTS index:
CALL QUERY_FTS_INDEX(TABLE_NAME, INDEX_NAME, QUERY_STRING)
The QUERY_FTS_INDEX function should return documents and their BM25 match score with the query in an unsorted order. Documents without scores (NULL score) is not output right now.
To drop a FTS index:
CALL DROP_FTS_INDEX(TABLE_NAME, INDEX_NAME)
The
create_fts_index
anddrop_fts_index
functions are treated as standalone functions, and will be rewritten during the parser stage.The
query_fts_index
function is currently rewritten as an internal gds function callfts
.TODOS:
Feature:
query_fts_index()
Optimization:
len
as a private filed in the node table.