-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve search and simplify the ranking #341
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @zereraz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request focuses on improving search result accuracy and display within the application. It addresses user schema scoring, ensures keywords are displayed in event descriptions, and includes minor frontend fixes. Specifically, it modifies the Vespa configuration to improve user search ranking, updates the search result component to display event description keywords, and adds a key prop to a component in the group filter.
Highlights
- User Schema Scoring: The Vespa user schema is updated to improve search ranking by adjusting the rank profile to use nativeRank and bm25 functions.
- Event Description Keywords: The search result component now displays keywords in event descriptions by utilizing chunks_summary. The description is chunked, and highlighted text is displayed.
- Frontend Fixes: A key prop is added to the GroupFilterItem component to resolve a React key issue.
Changelog
Click here to see the changelog
- frontend/src/components/GroupFilter.tsx
- Added a
key
prop to theGroupFilterItem
component to uniquely identify each item in the list.
- Added a
- frontend/src/components/SearchResult.tsx
- Updated the component to accept
result
andindex
props using a more explicit type definition. - Modified the component to map over all
chunks_summary
items instead of slicing and only displaying the first one for file, user, and event result types. - Modified the event result type to display keywords from
chunks_summary
in the description.
- Updated the component to accept
- server/ai/provider/fireworksClient.ts
- Removed example code.
- server/ai/provider/index.ts
- Removed console log statement.
- server/search/mappers.ts
- Added a function
countHiTags
to count the number of highlight tags in a string. - Modified the Vespa search response mapping to sort and slice
chunks_summary
based on the number of highlight tags for file, user, mail, and event result types. - For event results, the description is now chunked and used to generate
chunks_summary
. - Modified the Vespa autocomplete response mapping to add a type field to the result.
- Added a function
- server/search/vespa.ts
- Removed unused import
removeStopwords
. - Modified
autocomplete
andsearchVespa
to use the original query instead of removing stopwords.
- Removed unused import
- server/shared/types.ts
- Added
chunks_summary
property to theEventResponseSchema
.
- Added
- server/utils.ts
- Removed unused import
englishStopwords
. - Removed
removeStopwords
function.
- Removed unused import
- server/vespa/schemas/event.sd
- Updated
THREE_MONTHS_IN_SECONDS
constant. - Added a document-summary for the default case to enable bolding on the description.
- Updated
- server/vespa/schemas/user.sd
- Removed
enable-bm25
from thename
andemail
field indexing. - Modified the
default
rank profile to usenativeRank
andbm25
functions for scoring.
- Removed
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Did you know?
Vespa is named after a genus of wasps, known for their precision and efficiency in building nests, reflecting the platform's design for efficient data retrieval.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request addresses user schema scoring and displays keywords in the event description, enhancing search result relevance and user experience. The changes are well-structured and address the intended functionality. However, there are a few areas where improvements can be made to enhance code clarity and efficiency.
Summary of Findings
- Type Safety and Code Clarity: The code uses type assertions with
any
which reduces type safety. Using more specific types improves code readability and maintainability. - Efficiency in Data Processing: The code includes sorting and slicing operations on
chunks_summary
which could be optimized for performance, especially when dealing with large datasets. - Redundant Code: There are instances of redundant code, such as repeated sorting and slicing operations, that can be consolidated to improve efficiency and maintainability.
Merge Readiness
The pull request is almost ready for merging. Addressing the comments regarding type safety, code clarity, and efficiency will significantly improve the quality of the code. I am unable to directly approve this pull request, and recommend that others review and approve this code before merging. I recommend that the pull request not be merged until the comments are addressed.
if ((child.fields as VespaFileSearch).sddocname === fileSchema) { | ||
;(child.fields as any).type = fileSchema | ||
;(child.fields as any).relevance = child.relevance | ||
;(child.fields as any).chunks_summary = ( | ||
child.fields as VespaFileSearch | ||
).chunks_summary | ||
return FileResponseSchema.parse(child.fields) | ||
const fields = child.fields as VespaFileSearch & { | ||
type?: string | ||
chunks_summary?: string[] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// creating a new property | ||
fields.chunks_summary = fields.description | ||
? chunkDocument(fields.description) | ||
.map((v) => v.chunk) | ||
.sort((a, b) => countHiTags(b) - countHiTags(a)) | ||
.slice(0, maxSearchChunks) | ||
: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chunks_summary
is sorted and sliced multiple times across different schema types. Consider creating a utility function to handle this logic to avoid repetition and improve maintainability.
function processChunks(chunks: string[] | undefined): string[] | undefined {
if (!chunks) return undefined
const sorted = [...chunks].sort((a, b) => countHiTags(b) - countHiTags(a))
return sorted.slice(0, maxSearchChunks)
}
fields.chunks_summary = processChunks(fields.chunks_summary)
fields.chunks_summary = fields.description | ||
? chunkDocument(fields.description) | ||
.map((v) => v.chunk) | ||
.sort((a, b) => countHiTags(b) - countHiTags(a)) | ||
.slice(0, maxSearchChunks) | ||
: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new chunks_summary
property here might be unexpected. It would be clearer to modify the existing property, or rename it to avoid confusion.
fields.processed_chunks_summary = fields.description
? chunkDocument(fields.description)
.map((v) => v.chunk)
.sort((a, b) => countHiTags(b) - countHiTags(a))
.slice(0, maxSearchChunks)
: []
server/vespa/schemas/user.sd
Outdated
# bm25 gives a very high score as this has less text to match | ||
# compared to other schemas | ||
expression { | ||
(DECAY * scale(combined_bm25)) | ||
scale(combined_nativeRank + (bm25(email)/2 + bm25(name))/2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions that bm25 gives a very high score as this has less text to match compared to other schemas. Consider adding more details to the comment to explain why nativeRank
is being used and how it addresses this issue.
expression {
// bm25 gives a very high score as this has less text to match
// compared to other schemas. nativeRank is used to normalize the scores
// and prevent disproportionately high scores for shorter content.
scale(combined_nativeRank + (bm25(email)/2 + bm25(name))/2)
}
…s where necessary
Description
Testing
Additional Notes