SOLR-17632: Text to Vector Update Request Processor #3151

alessandrobenedetti · 2025-01-31T17:20:42Z

https://issues.apache.org/jira/browse/SOLR-17632

Description

Scope of this issue is to introduce support for automatic text vectorisation in Apache Solr.
A LLM fine-tuned for sentence similarity will be accessed to vectorise the text.
Apache Solr hosts the configuration parameters to access vectorision.

Solution

An Update Request Processor that access the managed resource and call the vectorisation service for each document to process.

Tests

Unit and integration tests have been added

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

alessandrobenedetti · 2025-01-31T17:24:48Z

Let's give it a first round of discussion/brainstorming/improvements.

Then I'll adjust the gradle check, documentation and changes.

In particular, I have some annoyance with tests before/after class stuff in intelliJ (stacktrace
UpdateRequestProcessorStacktrace.txt
attached)

I'll keep polishing it and working on it next week.

...va/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactory.java

.../src/java/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessor.java

cpoerschke · 2025-01-31T17:36:53Z

.../src/java/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessor.java

+     */
+    @Override
+    public void processAdd(AddUpdateCommand cmd) throws IOException {
+        this.textToVector = modelStore.getModel(model);


Wondering re: textToVector as class member vs. local variable.

edit: if model is configured i.e. always the same it could be a class member, perhaps, but if perhaps model was a field in the cmd document then it would need to be a local variable.

I was debugging the flow to have a better understanding of the lifecycle of an update request processor.

From what I see from the test, the factory instantiates a new update request processor every time a new update request is received.
I think it's ok to keep it a class member, but let me see if I can move the instantiation to the factory.
Ideally I wanted that to happen when the factory is initiate but It seems that the update request processor factory is not compatible with resource loading (as far as I debugged and checked)

cpoerschke · 2025-01-31T18:10:34Z

.../src/java/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessor.java

+        SolrInputField inputFieldContent = doc.get(inputField);
+        if (!isNullOrEmpty(inputFieldContent, doc, inputField)) {
+            String textToVectorise = inputFieldContent.getValue().toString();//add null checks and
+            float[] vector = textToVector.vectorise(textToVectorise);


So text-to-vector in the search case is ephemeral i.e. in case of errors or timeouts the user gets an error or exception back and they may or may not choose to retry.

For text-to-vector in the update case, might some users prefer to index only documents with (all the) vectors and others would rather have the document indexed even if some vectors are missing? (I assume but don't know for sure that the vectorise call might throw an exception in certain circumstances and if it's not caught that might fail the indexing for the entire document.)

I was chatting with @iamsanjay this morning, and I was expounding on the thought that a lot of folks might want to first index the document with just the core text/string/numbers, and then, since enrichment is SLOW, come back with a streaming expression and do things like vectorization, and an atomic update.. that way you pump your data in as fast as possible, and then enrich at your leisure... This model constrains your indexing speed to your vectorization speed. Not saying we should have the update request processor approach, but I also want to see a "out of band" vectorization process.

@cpoerschke : I double checked and the langchain4j library 'embed' method (that's used in our 'vectorise' method) returns a RuntimeException .
That's bad as it was not detected without investigating the internals of the code (I hate these practices).
I'll give it a thought, any suggestion is welcome!

@epugh : given that 'update.chain' is a parameter, if you configure a chain with no vector enrichment and a chain with vector enrichment, what prevents you from first index using the 'no vectors' chain and then slowly updating the index with atomic updates that add vectors (using the vector-chain)? We should double check and add to the documentation once we consolidate the code, what do you think?

dsmiley

I wanted to follow-up on my feedback to the LLM module concerning use of the word "embedding". I first tried to say that I was not familiar with the word, and your response was to remove it (completely?) from the module. If "embedding" is an appropriate word then use it. The documentation should reference it in the ref guide, even if just an "AKA".

...va/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactory.java

dsmiley · 2025-02-05T02:56:33Z

...va/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactory.java

+
+    @Override
+    public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) {
+        req.getCore().getLatestSchema().getField(inputField);


What's the point of this line?

it checks that 'inputField' is defined in the schema.
With the latest commit I changed it to make it more explicit but I am open to suggestions

dsmiley · 2025-02-05T02:57:52Z

...les/llm/src/test-files/solr/collection1/conf/solrconfig-llm-indexing-notDenseVectorField.xml

Do you really need new test configuration files? Could you combine this into another in this module?

I challenge devs on this; test config files are a pain to maintain.

I could, but the reason I added it is that I struggled to find testing methods such as org.apache.solr.util.RestTestBase#assertU(java.lang.String) that takes the chain as a parameter.
So I added as the default and I could test it.

I would not want to be the default when indexing docs for the query time test.
If you have any suggestion I'm open to changes

epugh

This is going to be exciting! some comments, and a wish list of "can we support out of indexing path enrichment"

epugh · 2025-02-05T16:14:31Z

.../src/java/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessor.java

+ * limitations under the License.
+ */
+
+package org.apache.solr.llm.texttovector.update.processor;


texttovector looks odd.. and I know a. package like text.to.vector is silly. Isn't this an embedding tool? text to vector is making an embedding right? or... llm.vectorization.update.processor?

I agree, texttovector is horribly unreadable, maybe 'textvectorisation' ? adding it in the coming commit

epugh · 2025-02-05T16:18:21Z

.../src/java/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessor.java

+        SolrInputField inputFieldContent = doc.get(inputField);
+        if (!isNullOrEmpty(inputFieldContent, doc, inputField)) {
+            String textToVectorise = inputFieldContent.getValue().toString();//add null checks and
+            float[] vector = textToVector.vectorise(textToVectorise);


I was chatting with @iamsanjay this morning, and I was expounding on the thought that a lot of folks might want to first index the document with just the core text/string/numbers, and then, since enrichment is SLOW, come back with a streaming expression and do things like vectorization, and an atomic update.. that way you pump your data in as fast as possible, and then enrich at your leisure... This model constrains your indexing speed to your vectorization speed. Not saying we should have the update request processor approach, but I also want to see a "out of band" vectorization process.

epugh · 2025-02-05T16:19:46Z

.../src/java/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessor.java

+        super.processAdd(cmd);
+    }
+
+    protected boolean isNullOrEmpty(SolrInputField inputFieldContent, SolrInputDocument doc, String fieldName) {


do we warn on these types of "data model" issues in other aspects of solr? I mean, who would ever check solr logs to spot that these input fields are missing? I see this cluttering the logs and not being very useful to have operationally.. I think I would just query solr "show me all docs that don't have the field populate" to find the issue....

mmm I see your point, better if we just log a warning say "vectorisation failed", with the reason "null or empty source field" ?

I suspect that silent failure would be equally problematic to understand why there are no vectors? (I also just discovered that the 'vectorise' method could throw runtime exception)

...va/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactory.java

...rg/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactoryTest.java

.../test/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorTest.java

solr/test-framework/src/java/org/apache/solr/util/RestTestBase.java

…ate/processor/TextToVectorUpdateProcessorFactory.java Co-authored-by: Christine Poerschke <[email protected]>

…ate/processor/TextToVectorUpdateProcessor.java Co-authored-by: Christine Poerschke <[email protected]>

alessandrobenedetti · 2025-02-05T19:40:33Z

I wanted to follow-up on my feedback to the LLM module concerning use of the word "embedding". I first tried to say that I was not familiar with the word, and your response was to remove it (completely?) from the module. If "embedding" is an appropriate word then use it. The documentation should reference it in the ref guide, even if just an "AKA".

Embedding is widely used in the field, but it's a bit ambiguous and to be honest, I'm with you in not using any term that can cause confusion. Do you mean I added embedding in here somewhere? If that's the case, It's a mistake, point it to me and I'll remove it!

alessandrobenedetti · 2025-02-06T13:03:52Z

Added one iteration of polishing, should have addressed @cpoerschke concerns on vectorisation failures (I took inspiration from the lang detect update processor).

I also removed the additional test solr config files addressing @dsmiley concerns.

I'm still puzzled by the testing errors I get (the before/after problems I mentioned in the first comment), any help there would be beneficial.

I think we made progress, I;ll wait some other iterations and then work on the documentation.

dsmiley

Basically, we should ensure "embedding" is used at least once in the ref guide page so it's findable for users that know that term. I'm becoming familiar with it.

dsmiley · 2025-02-06T14:13:44Z

...g/apache/solr/llm/textvectorisation/update/processor/TextToVectorUpdateProcessorFactory.java

+        assertIsDenseVectorField(outputFieldSchema);
+
+        ManagedTextToVectorModelStore modelStore = ManagedTextToVectorModelStore.getManagedModelStore(req.getCore());
+        SolrTextToVectorModel textToVector = modelStore.getModel(modelName);


I presume looking up the model is fast/cached if it already exists?

dsmiley · 2025-02-06T14:16:16Z

...dules/llm/src/test/org/apache/solr/llm/textvectorisation/search/TextToVectorQParserTest.java

+  @AfterClass
+  public static void cleanup() throws Exception {
+    afterTest();
+  }


weird; why? Weird to see a test cleanup also be a suite cleanup.
(same for TestModelManager)

dsmiley · 2025-02-06T14:19:14Z

...ache/solr/llm/textvectorisation/update/processor/TextToVectorUpdateProcessorFactoryTest.java

+public class TextToVectorUpdateProcessorFactoryTest extends TestLlmBase {
+  private TextToVectorUpdateProcessorFactory factoryToTest =
+      new TextToVectorUpdateProcessorFactory();
+  private NamedList<String> args = new NamedList<>();


I think it's awkward to have this as a field, as you need to tend to its lifecycle by cleaning. It's a lightweight trivial object so why not have the test classes initialize them?

dsmiley · 2025-02-06T14:20:54Z

...ache/solr/llm/textvectorisation/update/processor/TextToVectorUpdateProcessorFactoryTest.java

+    Map<String, String[]> params = new HashMap<>();
+    MultiMapSolrParams mmparams = new MultiMapSolrParams(params);


This is more complex than needed; just create a ModifiableSolrParams (one line). Soon after I merge a PR up for review, you would do SolrParams.of().

alessandrobenedetti added 4 commits January 31, 2025 16:05

SOLR-17632: first draft implementation

8f51adc

SOLR-17632: first draft implementation

04d3ba0

SOLR-17632: first draft implementation

bdffe76

SOLR-17632: empty and null checks

e983e04

github-actions bot added test-framework tool:build tests labels Jan 31, 2025

SOLR-17632: empty and null checks

ba8f28f

github-actions bot removed the tool:build label Jan 31, 2025

alessandrobenedetti requested review from dsmiley, epugh and cpoerschke January 31, 2025 17:23

cpoerschke reviewed Jan 31, 2025

View reviewed changes

...va/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactory.java Outdated Show resolved Hide resolved

cpoerschke reviewed Jan 31, 2025

View reviewed changes

.../src/java/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessor.java Outdated Show resolved Hide resolved

cpoerschke reviewed Jan 31, 2025

View reviewed changes

dsmiley reviewed Feb 5, 2025

View reviewed changes

epugh reviewed Feb 5, 2025

View reviewed changes

alessandrobenedetti and others added 3 commits February 5, 2025 18:45

Update solr/modules/llm/src/java/org/apache/solr/llm/texttovector/upd…

7673b8b

…ate/processor/TextToVectorUpdateProcessorFactory.java Co-authored-by: Christine Poerschke <[email protected]>

Update solr/modules/llm/src/java/org/apache/solr/llm/texttovector/upd…

3066e01

…ate/processor/TextToVectorUpdateProcessor.java Co-authored-by: Christine Poerschke <[email protected]>

SOLR-17632: model loading mvoed to the factory

bf084f6

alessandrobenedetti added 2 commits February 5, 2025 20:24

SOLR-17632: changes from David review

c5fb1ed

SOLR-17632: package rename following Eric review

569de7b

github-actions bot added the documentation Improvements or additions to documentation label Feb 5, 2025

alessandrobenedetti added 2 commits February 5, 2025 20:41

SOLR-17632: cleanup comments

84710b4

SOLR-17632: cleanup but still problems with test before/after

7632c56

dsmiley reviewed Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-17632: Text to Vector Update Request Processor #3151

SOLR-17632: Text to Vector Update Request Processor #3151

alessandrobenedetti commented Jan 31, 2025 •

edited

Loading

alessandrobenedetti commented Jan 31, 2025 •

edited

Loading

cpoerschke Jan 31, 2025 •

edited

Loading

alessandrobenedetti Feb 5, 2025

cpoerschke Jan 31, 2025

epugh Feb 5, 2025

alessandrobenedetti Feb 5, 2025 •

edited

Loading

dsmiley left a comment

dsmiley Feb 5, 2025

alessandrobenedetti Feb 5, 2025

dsmiley Feb 5, 2025

alessandrobenedetti Feb 5, 2025

epugh left a comment

epugh Feb 5, 2025

alessandrobenedetti Feb 5, 2025

epugh Feb 5, 2025

epugh Feb 5, 2025

alessandrobenedetti Feb 5, 2025

alessandrobenedetti commented Feb 5, 2025

alessandrobenedetti commented Feb 6, 2025

dsmiley left a comment

dsmiley Feb 6, 2025

dsmiley Feb 6, 2025

dsmiley Feb 6, 2025

dsmiley Feb 6, 2025

		Map<String, String[]> params = new HashMap<>();
		MultiMapSolrParams mmparams = new MultiMapSolrParams(params);

SOLR-17632: Text to Vector Update Request Processor #3151

Are you sure you want to change the base?

SOLR-17632: Text to Vector Update Request Processor #3151

Conversation

alessandrobenedetti commented Jan 31, 2025 • edited Loading

Description

Solution

Tests

Checklist

alessandrobenedetti commented Jan 31, 2025 • edited Loading

cpoerschke Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alessandrobenedetti Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

dsmiley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epugh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alessandrobenedetti commented Feb 5, 2025

alessandrobenedetti commented Feb 6, 2025

dsmiley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alessandrobenedetti commented Jan 31, 2025 •

edited

Loading

alessandrobenedetti commented Jan 31, 2025 •

edited

Loading

cpoerschke Jan 31, 2025 •

edited

Loading

alessandrobenedetti Feb 5, 2025 •

edited

Loading