Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17632: Text to Vector Update Request Processor #3151

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

alessandrobenedetti
Copy link
Contributor

@alessandrobenedetti alessandrobenedetti commented Jan 31, 2025

https://issues.apache.org/jira/browse/SOLR-17632

Description

Scope of this issue is to introduce support for automatic text vectorisation in Apache Solr.
A LLM fine-tuned for sentence similarity will be accessed to vectorise the text.
Apache Solr hosts the configuration parameters to access vectorision.

Solution

An Update Request Processor that access the managed resource and call the vectorisation service for each document to process.

Tests

Unit and integration tests have been added

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@alessandrobenedetti
Copy link
Contributor Author

alessandrobenedetti commented Jan 31, 2025

Let's give it a first round of discussion/brainstorming/improvements.

Then I'll adjust the gradle check, documentation and changes.

In particular, I have some annoyance with tests before/after class stuff in intelliJ (stacktrace
UpdateRequestProcessorStacktrace.txt
attached)

I'll keep polishing it and working on it next week.

*/
@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
this.textToVector = modelStore.getModel(model);
Copy link
Contributor

@cpoerschke cpoerschke Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering re: textToVector as class member vs. local variable.

edit: if model is configured i.e. always the same it could be a class member, perhaps, but if perhaps model was a field in the cmd document then it would need to be a local variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debugging the flow to have a better understanding of the lifecycle of an update request processor.

From what I see from the test, the factory instantiates a new update request processor every time a new update request is received.
I think it's ok to keep it a class member, but let me see if I can move the instantiation to the factory.
Ideally I wanted that to happen when the factory is initiate but It seems that the update request processor factory is not compatible with resource loading (as far as I debugged and checked)

SolrInputField inputFieldContent = doc.get(inputField);
if (!isNullOrEmpty(inputFieldContent, doc, inputField)) {
String textToVectorise = inputFieldContent.getValue().toString();//add null checks and
float[] vector = textToVector.vectorise(textToVectorise);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So text-to-vector in the search case is ephemeral i.e. in case of errors or timeouts the user gets an error or exception back and they may or may not choose to retry.

For text-to-vector in the update case, might some users prefer to index only documents with (all the) vectors and others would rather have the document indexed even if some vectors are missing? (I assume but don't know for sure that the vectorise call might throw an exception in certain circumstances and if it's not caught that might fail the indexing for the entire document.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was chatting with @iamsanjay this morning, and I was expounding on the thought that a lot of folks might want to first index the document with just the core text/string/numbers, and then, since enrichment is SLOW, come back with a streaming expression and do things like vectorization, and an atomic update.. that way you pump your data in as fast as possible, and then enrich at your leisure... This model constrains your indexing speed to your vectorization speed. Not saying we should have the update request processor approach, but I also want to see a "out of band" vectorization process.

Copy link
Contributor Author

@alessandrobenedetti alessandrobenedetti Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. @cpoerschke : I double checked and the langchain4j library 'embed' method (that's used in our 'vectorise' method) returns a RuntimeException .
    That's bad as it was not detected without investigating the internals of the code (I hate these practices).
    I'll give it a thought, any suggestion is welcome!

  2. @epugh : given that 'update.chain' is a parameter, if you configure a chain with no vector enrichment and a chain with vector enrichment, what prevents you from first index using the 'no vectors' chain and then slowly updating the index with atomic updates that add vectors (using the vector-chain)? We should double check and add to the documentation once we consolidate the code, what do you think?

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to follow-up on my feedback to the LLM module concerning use of the word "embedding". I first tried to say that I was not familiar with the word, and your response was to remove it (completely?) from the module. If "embedding" is an appropriate word then use it. The documentation should reference it in the ref guide, even if just an "AKA".


@Override
public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) {
req.getCore().getLatestSchema().getField(inputField);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it checks that 'inputField' is defined in the schema.
With the latest commit I changed it to make it more explicit but I am open to suggestions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need new test configuration files? Could you combine this into another in this module?

I challenge devs on this; test config files are a pain to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but the reason I added it is that I struggled to find testing methods such as org.apache.solr.util.RestTestBase#assertU(java.lang.String) that takes the chain as a parameter.
So I added as the default and I could test it.

I would not want to be the default when indexing docs for the query time test.
If you have any suggestion I'm open to changes

Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be exciting! some comments, and a wish list of "can we support out of indexing path enrichment"

* limitations under the License.
*/

package org.apache.solr.llm.texttovector.update.processor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

texttovector looks odd.. and I know a. package like text.to.vector is silly. Isn't this an embedding tool? text to vector is making an embedding right? or... llm.vectorization.update.processor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, texttovector is horribly unreadable, maybe 'textvectorisation' ? adding it in the coming commit

SolrInputField inputFieldContent = doc.get(inputField);
if (!isNullOrEmpty(inputFieldContent, doc, inputField)) {
String textToVectorise = inputFieldContent.getValue().toString();//add null checks and
float[] vector = textToVector.vectorise(textToVectorise);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was chatting with @iamsanjay this morning, and I was expounding on the thought that a lot of folks might want to first index the document with just the core text/string/numbers, and then, since enrichment is SLOW, come back with a streaming expression and do things like vectorization, and an atomic update.. that way you pump your data in as fast as possible, and then enrich at your leisure... This model constrains your indexing speed to your vectorization speed. Not saying we should have the update request processor approach, but I also want to see a "out of band" vectorization process.

super.processAdd(cmd);
}

protected boolean isNullOrEmpty(SolrInputField inputFieldContent, SolrInputDocument doc, String fieldName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we warn on these types of "data model" issues in other aspects of solr? I mean, who would ever check solr logs to spot that these input fields are missing? I see this cluttering the logs and not being very useful to have operationally.. I think I would just query solr "show me all docs that don't have the field populate" to find the issue....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm I see your point, better if we just log a warning say "vectorisation failed", with the reason "null or empty source field" ?

I suspect that silent failure would be equally problematic to understand why there are no vectors? (I also just discovered that the 'vectorise' method could throw runtime exception)

alessandrobenedetti and others added 3 commits February 5, 2025 18:45
…ate/processor/TextToVectorUpdateProcessorFactory.java

Co-authored-by: Christine Poerschke <[email protected]>
…ate/processor/TextToVectorUpdateProcessor.java

Co-authored-by: Christine Poerschke <[email protected]>
@alessandrobenedetti
Copy link
Contributor Author

I wanted to follow-up on my feedback to the LLM module concerning use of the word "embedding". I first tried to say that I was not familiar with the word, and your response was to remove it (completely?) from the module. If "embedding" is an appropriate word then use it. The documentation should reference it in the ref guide, even if just an "AKA".

Embedding is widely used in the field, but it's a bit ambiguous and to be honest, I'm with you in not using any term that can cause confusion. Do you mean I added embedding in here somewhere? If that's the case, It's a mistake, point it to me and I'll remove it!

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 5, 2025
@alessandrobenedetti
Copy link
Contributor Author

Added one iteration of polishing, should have addressed @cpoerschke concerns on vectorisation failures (I took inspiration from the lang detect update processor).

I also removed the additional test solr config files addressing @dsmiley concerns.

I'm still puzzled by the testing errors I get (the before/after problems I mentioned in the first comment), any help there would be beneficial.

I think we made progress, I;ll wait some other iterations and then work on the documentation.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, we should ensure "embedding" is used at least once in the ref guide page so it's findable for users that know that term. I'm becoming familiar with it.

assertIsDenseVectorField(outputFieldSchema);

ManagedTextToVectorModelStore modelStore = ManagedTextToVectorModelStore.getManagedModelStore(req.getCore());
SolrTextToVectorModel textToVector = modelStore.getModel(modelName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume looking up the model is fast/cached if it already exists?

Comment on lines +33 to +36
@AfterClass
public static void cleanup() throws Exception {
afterTest();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird; why? Weird to see a test cleanup also be a suite cleanup.
(same for TestModelManager)

public class TextToVectorUpdateProcessorFactoryTest extends TestLlmBase {
private TextToVectorUpdateProcessorFactory factoryToTest =
new TextToVectorUpdateProcessorFactory();
private NamedList<String> args = new NamedList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's awkward to have this as a field, as you need to tend to its lifecycle by cleaning. It's a lightweight trivial object so why not have the test classes initialize them?

Comment on lines +94 to +95
Map<String, String[]> params = new HashMap<>();
MultiMapSolrParams mmparams = new MultiMapSolrParams(params);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more complex than needed; just create a ModifiableSolrParams (one line). Soon after I merge a PR up for review, you would do SolrParams.of().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation test-framework tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants