Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve search and simplify the ranking #341

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

improve search and simplify the ranking #341

wants to merge 10 commits into from

Conversation

zereraz
Copy link
Contributor

@zereraz zereraz commented Feb 28, 2025

Description

Testing

Additional Notes

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @zereraz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request focuses on improving search result accuracy and display within the application. It addresses user schema scoring, ensures keywords are displayed in event descriptions, and includes minor frontend fixes. Specifically, it modifies the Vespa configuration to improve user search ranking, updates the search result component to display event description keywords, and adds a key prop to a component in the group filter.

Highlights

  • User Schema Scoring: The Vespa user schema is updated to improve search ranking by adjusting the rank profile to use nativeRank and bm25 functions.
  • Event Description Keywords: The search result component now displays keywords in event descriptions by utilizing chunks_summary. The description is chunked, and highlighted text is displayed.
  • Frontend Fixes: A key prop is added to the GroupFilterItem component to resolve a React key issue.

Changelog

Click here to see the changelog
  • frontend/src/components/GroupFilter.tsx
    • Added a key prop to the GroupFilterItem component to uniquely identify each item in the list.
  • frontend/src/components/SearchResult.tsx
    • Updated the component to accept result and index props using a more explicit type definition.
    • Modified the component to map over all chunks_summary items instead of slicing and only displaying the first one for file, user, and event result types.
    • Modified the event result type to display keywords from chunks_summary in the description.
  • server/ai/provider/fireworksClient.ts
    • Removed example code.
  • server/ai/provider/index.ts
    • Removed console log statement.
  • server/search/mappers.ts
    • Added a function countHiTags to count the number of highlight tags in a string.
    • Modified the Vespa search response mapping to sort and slice chunks_summary based on the number of highlight tags for file, user, mail, and event result types.
    • For event results, the description is now chunked and used to generate chunks_summary.
    • Modified the Vespa autocomplete response mapping to add a type field to the result.
  • server/search/vespa.ts
    • Removed unused import removeStopwords.
    • Modified autocomplete and searchVespa to use the original query instead of removing stopwords.
  • server/shared/types.ts
    • Added chunks_summary property to the EventResponseSchema.
  • server/utils.ts
    • Removed unused import englishStopwords.
    • Removed removeStopwords function.
  • server/vespa/schemas/event.sd
    • Updated THREE_MONTHS_IN_SECONDS constant.
    • Added a document-summary for the default case to enable bolding on the description.
  • server/vespa/schemas/user.sd
    • Removed enable-bm25 from the name and email field indexing.
    • Modified the default rank profile to use nativeRank and bm25 functions for scoring.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Did you know?

Vespa is named after a genus of wasps, known for their precision and efficiency in building nests, reflecting the platform's design for efficient data retrieval.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request addresses user schema scoring and displays keywords in the event description, enhancing search result relevance and user experience. The changes are well-structured and address the intended functionality. However, there are a few areas where improvements can be made to enhance code clarity and efficiency.

Summary of Findings

  • Type Safety and Code Clarity: The code uses type assertions with any which reduces type safety. Using more specific types improves code readability and maintainability.
  • Efficiency in Data Processing: The code includes sorting and slicing operations on chunks_summary which could be optimized for performance, especially when dealing with large datasets.
  • Redundant Code: There are instances of redundant code, such as repeated sorting and slicing operations, that can be consolidated to improve efficiency and maintainability.

Merge Readiness

The pull request is almost ready for merging. Addressing the comments regarding type safety, code clarity, and efficiency will significantly improve the quality of the code. I am unable to directly approve this pull request, and recommend that others review and approve this code before merging. I recommend that the pull request not be merged until the comments are addressed.

Comment on lines 65 to +69
if ((child.fields as VespaFileSearch).sddocname === fileSchema) {
;(child.fields as any).type = fileSchema
;(child.fields as any).relevance = child.relevance
;(child.fields as any).chunks_summary = (
child.fields as VespaFileSearch
).chunks_summary
return FileResponseSchema.parse(child.fields)
const fields = child.fields as VespaFileSearch & {
type?: string
chunks_summary?: string[]
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider defining a specific type for fields instead of using & { type?: string; chunks_summary?: string[] }. This would improve type safety and code readability.

type FileFields = VespaFileSearch & {
  type?: string
  chunks_summary?: string[]
}

const fields = child.fields as FileFields

Comment on lines +118 to +124
// creating a new property
fields.chunks_summary = fields.description
? chunkDocument(fields.description)
.map((v) => v.chunk)
.sort((a, b) => countHiTags(b) - countHiTags(a))
.slice(0, maxSearchChunks)
: []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The chunks_summary is sorted and sliced multiple times across different schema types. Consider creating a utility function to handle this logic to avoid repetition and improve maintainability.

function processChunks(chunks: string[] | undefined): string[] | undefined {
  if (!chunks) return undefined
  const sorted = [...chunks].sort((a, b) => countHiTags(b) - countHiTags(a))
  return sorted.slice(0, maxSearchChunks)
}

fields.chunks_summary = processChunks(fields.chunks_summary)

Comment on lines +119 to +124
fields.chunks_summary = fields.description
? chunkDocument(fields.description)
.map((v) => v.chunk)
.sort((a, b) => countHiTags(b) - countHiTags(a))
.slice(0, maxSearchChunks)
: []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating a new chunks_summary property here might be unexpected. It would be clearer to modify the existing property, or rename it to avoid confusion.

fields.processed_chunks_summary = fields.description
  ? chunkDocument(fields.description)
      .map((v) => v.chunk)
      .sort((a, b) => countHiTags(b) - countHiTags(a))
      .slice(0, maxSearchChunks)
  : []

Comment on lines 195 to 198
# bm25 gives a very high score as this has less text to match
# compared to other schemas
expression {
(DECAY * scale(combined_bm25))
scale(combined_nativeRank + (bm25(email)/2 + bm25(name))/2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment mentions that bm25 gives a very high score as this has less text to match compared to other schemas. Consider adding more details to the comment to explain why nativeRank is being used and how it addresses this issue.

expression {
  // bm25 gives a very high score as this has less text to match
  // compared to other schemas. nativeRank is used to normalize the scores
  // and prevent disproportionately high scores for shorter content.
  scale(combined_nativeRank + (bm25(email)/2 + bm25(name))/2)
}

@zereraz zereraz changed the title fix user schema scoring and show keywords in event description improve search and simplify the ranking Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants