improve search and simplify the ranking #341

zereraz · 2025-02-28T00:48:14Z

Description

Testing

Additional Notes

gemini-code-assist

Hello @zereraz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request focuses on improving search result accuracy and display within the application. It addresses user schema scoring, ensures keywords are displayed in event descriptions, and includes minor frontend fixes. Specifically, it modifies the Vespa configuration to improve user search ranking, updates the search result component to display event description keywords, and adds a key prop to a component in the group filter.

Highlights

User Schema Scoring: The Vespa user schema is updated to improve search ranking by adjusting the rank profile to use nativeRank and bm25 functions.
Event Description Keywords: The search result component now displays keywords in event descriptions by utilizing chunks_summary. The description is chunked, and highlighted text is displayed.
Frontend Fixes: A key prop is added to the GroupFilterItem component to resolve a React key issue.

Changelog

Click here to see the changelog

frontend/src/components/GroupFilter.tsx
- Added a key prop to the GroupFilterItem component to uniquely identify each item in the list.
frontend/src/components/SearchResult.tsx
- Updated the component to accept result and index props using a more explicit type definition.
- Modified the component to map over all chunks_summary items instead of slicing and only displaying the first one for file, user, and event result types.
- Modified the event result type to display keywords from chunks_summary in the description.
server/ai/provider/fireworksClient.ts
- Removed example code.
server/ai/provider/index.ts
- Removed console log statement.
server/search/mappers.ts
- Added a function countHiTags to count the number of highlight tags in a string.
- Modified the Vespa search response mapping to sort and slice chunks_summary based on the number of highlight tags for file, user, mail, and event result types.
- For event results, the description is now chunked and used to generate chunks_summary.
- Modified the Vespa autocomplete response mapping to add a type field to the result.
server/search/vespa.ts
- Removed unused import removeStopwords.
- Modified autocomplete and searchVespa to use the original query instead of removing stopwords.
server/shared/types.ts
- Added chunks_summary property to the EventResponseSchema.
server/utils.ts
- Removed unused import englishStopwords.
- Removed removeStopwords function.
server/vespa/schemas/event.sd
- Updated THREE_MONTHS_IN_SECONDS constant.
- Added a document-summary for the default case to enable bolding on the description.
server/vespa/schemas/user.sd
- Removed enable-bm25 from the name and email field indexing.
- Modified the default rank profile to use nativeRank and bm25 functions for scoring.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Did you know?

Vespa is named after a genus of wasps, known for their precision and efficiency in building nests, reflecting the platform's design for efficient data retrieval.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request addresses user schema scoring and displays keywords in the event description, enhancing search result relevance and user experience. The changes are well-structured and address the intended functionality. However, there are a few areas where improvements can be made to enhance code clarity and efficiency.

Summary of Findings

Type Safety and Code Clarity: The code uses type assertions with any which reduces type safety. Using more specific types improves code readability and maintainability.
Efficiency in Data Processing: The code includes sorting and slicing operations on chunks_summary which could be optimized for performance, especially when dealing with large datasets.
Redundant Code: There are instances of redundant code, such as repeated sorting and slicing operations, that can be consolidated to improve efficiency and maintainability.

Merge Readiness

The pull request is almost ready for merging. Addressing the comments regarding type safety, code clarity, and efficiency will significantly improve the quality of the code. I am unable to directly approve this pull request, and recommend that others review and approve this code before merging. I recommend that the pull request not be merged until the comments are addressed.

gemini-code-assist · 2025-02-28T00:48:43Z

server/search/mappers.ts

          if ((child.fields as VespaFileSearch).sddocname === fileSchema) {
-            ;(child.fields as any).type = fileSchema
-            ;(child.fields as any).relevance = child.relevance
-            ;(child.fields as any).chunks_summary = (
-              child.fields as VespaFileSearch
-            ).chunks_summary
-            return FileResponseSchema.parse(child.fields)
+            const fields = child.fields as VespaFileSearch & {
+              type?: string
+              chunks_summary?: string[]
+            }


Consider defining a specific type for fields instead of using & { type?: string; chunks_summary?: string[] }. This would improve type safety and code readability.

type FileFields = VespaFileSearch & { type?: string chunks_summary?: string[] } const fields = child.fields as FileFields

gemini-code-assist · 2025-02-28T00:48:43Z

server/search/mappers.ts

+            // creating a new property
+            fields.chunks_summary = fields.description
+              ? chunkDocument(fields.description)
+                  .map((v) => v.chunk)
+                  .sort((a, b) => countHiTags(b) - countHiTags(a))
+                  .slice(0, maxSearchChunks)
+              : []


The chunks_summary is sorted and sliced multiple times across different schema types. Consider creating a utility function to handle this logic to avoid repetition and improve maintainability.

function processChunks(chunks: string[] | undefined): string[] | undefined { if (!chunks) return undefined const sorted = [...chunks].sort((a, b) => countHiTags(b) - countHiTags(a)) return sorted.slice(0, maxSearchChunks) } fields.chunks_summary = processChunks(fields.chunks_summary)

gemini-code-assist · 2025-02-28T00:48:43Z

server/search/mappers.ts

+            fields.chunks_summary = fields.description
+              ? chunkDocument(fields.description)
+                  .map((v) => v.chunk)
+                  .sort((a, b) => countHiTags(b) - countHiTags(a))
+                  .slice(0, maxSearchChunks)
+              : []


Creating a new chunks_summary property here might be unexpected. It would be clearer to modify the existing property, or rename it to avoid confusion.

fields.processed_chunks_summary = fields.description ? chunkDocument(fields.description) .map((v) => v.chunk) .sort((a, b) => countHiTags(b) - countHiTags(a)) .slice(0, maxSearchChunks) : []

gemini-code-assist · 2025-02-28T00:48:43Z

server/vespa/schemas/user.sd

+            # bm25 gives a very high score as this has less text to match
+            # compared to other schemas
            expression {
-               (DECAY * scale(combined_bm25))
+               scale(combined_nativeRank + (bm25(email)/2 + bm25(name))/2)


The comment mentions that bm25 gives a very high score as this has less text to match compared to other schemas. Consider adding more details to the comment to explain why nativeRank is being used and how it addresses this issue.

expression { // bm25 gives a very high score as this has less text to match // compared to other schemas. nativeRank is used to normalize the scores // and prevent disproportionately high scores for shorter content. scale(combined_nativeRank + (bm25(email)/2 + bm25(name))/2) }

…s where necessary

fix user schema scoring and show keywords in event description

9e8878f

gemini-code-assist bot reviewed Feb 28, 2025

View reviewed changes

gemini-code-assist bot suggested changes Feb 28, 2025

View reviewed changes

zereraz changed the title ~~fix user schema scoring and show keywords in event description~~ improve search and simplify the ranking Feb 28, 2025

zereraz and others added 9 commits February 28, 2025 17:19

add improved highlight sorting, remove stop words and retain new line…

75b01e6

…s where necessary

simplify rank expression

aedab97

fix attachment count while ingestion for service account

75f4860

remove title embeddings

0a89a2f

added nativeRank rank-profile

770b02a

update searchVespa to use different rank-profiles

7e888b0

scale vector score for event

907e9d6

update alpha value to default 0.2

d0c3fc7

remove first-phase for user schema

2424c2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve search and simplify the ranking #341

improve search and simplify the ranking #341

zereraz commented Feb 28, 2025

gemini-code-assist bot left a comment

gemini-code-assist bot left a comment

gemini-code-assist bot Feb 28, 2025

gemini-code-assist bot Feb 28, 2025

gemini-code-assist bot Feb 28, 2025

gemini-code-assist bot Feb 28, 2025

improve search and simplify the ranking #341

Are you sure you want to change the base?

improve search and simplify the ranking #341

Conversation

zereraz commented Feb 28, 2025

Description

Testing

Additional Notes

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

gemini-code-assist bot Feb 28, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Feb 28, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Feb 28, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Feb 28, 2025

Choose a reason for hiding this comment