-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17632: Text to Vector Update Request Processor #3151
base: main
Are you sure you want to change the base?
Conversation
Let's give it a first round of discussion/brainstorming/improvements. Then I'll adjust the gradle check, documentation and changes. In particular, I have some annoyance with tests before/after class stuff in intelliJ (stacktrace I'll keep polishing it and working on it next week. |
...va/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
.../src/java/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessor.java
Outdated
Show resolved
Hide resolved
*/ | ||
@Override | ||
public void processAdd(AddUpdateCommand cmd) throws IOException { | ||
this.textToVector = modelStore.getModel(model); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering re: textToVector
as class member vs. local variable.
edit: if model
is configured i.e. always the same it could be a class member, perhaps, but if perhaps model
was a field in the cmd
document then it would need to be a local variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was debugging the flow to have a better understanding of the lifecycle of an update request processor.
From what I see from the test, the factory instantiates a new update request processor every time a new update request is received.
I think it's ok to keep it a class member, but let me see if I can move the instantiation to the factory.
Ideally I wanted that to happen when the factory is initiate but It seems that the update request processor factory is not compatible with resource loading (as far as I debugged and checked)
SolrInputField inputFieldContent = doc.get(inputField); | ||
if (!isNullOrEmpty(inputFieldContent, doc, inputField)) { | ||
String textToVectorise = inputFieldContent.getValue().toString();//add null checks and | ||
float[] vector = textToVector.vectorise(textToVectorise); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So text-to-vector in the search case is ephemeral i.e. in case of errors or timeouts the user gets an error or exception back and they may or may not choose to retry.
For text-to-vector in the update case, might some users prefer to index only documents with (all the) vectors and others would rather have the document indexed even if some vectors are missing? (I assume but don't know for sure that the vectorise
call might throw an exception in certain circumstances and if it's not caught that might fail the indexing for the entire document.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was chatting with @iamsanjay this morning, and I was expounding on the thought that a lot of folks might want to first index the document with just the core text/string/numbers, and then, since enrichment is SLOW, come back with a streaming expression and do things like vectorization, and an atomic update.. that way you pump your data in as fast as possible, and then enrich at your leisure... This model constrains your indexing speed to your vectorization speed. Not saying we should have the update request processor approach, but I also want to see a "out of band" vectorization process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
@cpoerschke : I double checked and the langchain4j library 'embed' method (that's used in our 'vectorise' method) returns a RuntimeException .
That's bad as it was not detected without investigating the internals of the code (I hate these practices).
I'll give it a thought, any suggestion is welcome! -
@epugh : given that 'update.chain' is a parameter, if you configure a chain with no vector enrichment and a chain with vector enrichment, what prevents you from first index using the 'no vectors' chain and then slowly updating the index with atomic updates that add vectors (using the vector-chain)? We should double check and add to the documentation once we consolidate the code, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to follow-up on my feedback to the LLM module concerning use of the word "embedding". I first tried to say that I was not familiar with the word, and your response was to remove it (completely?) from the module. If "embedding" is an appropriate word then use it. The documentation should reference it in the ref guide, even if just an "AKA".
...va/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
...va/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { | ||
req.getCore().getLatestSchema().getField(inputField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it checks that 'inputField' is defined in the schema.
With the latest commit I changed it to make it more explicit but I am open to suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need new test configuration files? Could you combine this into another in this module?
I challenge devs on this; test config files are a pain to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but the reason I added it is that I struggled to find testing methods such as org.apache.solr.util.RestTestBase#assertU(java.lang.String) that takes the chain as a parameter.
So I added as the default and I could test it.
I would not want to be the default when indexing docs for the query time test.
If you have any suggestion I'm open to changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be exciting! some comments, and a wish list of "can we support out of indexing path enrichment"
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.solr.llm.texttovector.update.processor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
texttovector
looks odd.. and I know a. package like text.to.vector
is silly. Isn't this an embedding tool? text to vector is making an embedding right? or... llm.vectorization.update.processor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, texttovector is horribly unreadable, maybe 'textvectorisation' ? adding it in the coming commit
SolrInputField inputFieldContent = doc.get(inputField); | ||
if (!isNullOrEmpty(inputFieldContent, doc, inputField)) { | ||
String textToVectorise = inputFieldContent.getValue().toString();//add null checks and | ||
float[] vector = textToVector.vectorise(textToVectorise); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was chatting with @iamsanjay this morning, and I was expounding on the thought that a lot of folks might want to first index the document with just the core text/string/numbers, and then, since enrichment is SLOW, come back with a streaming expression and do things like vectorization, and an atomic update.. that way you pump your data in as fast as possible, and then enrich at your leisure... This model constrains your indexing speed to your vectorization speed. Not saying we should have the update request processor approach, but I also want to see a "out of band" vectorization process.
super.processAdd(cmd); | ||
} | ||
|
||
protected boolean isNullOrEmpty(SolrInputField inputFieldContent, SolrInputDocument doc, String fieldName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we warn on these types of "data model" issues in other aspects of solr? I mean, who would ever check solr logs to spot that these input fields are missing? I see this cluttering the logs and not being very useful to have operationally.. I think I would just query solr "show me all docs that don't have the field populate" to find the issue....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm I see your point, better if we just log a warning say "vectorisation failed", with the reason "null or empty source field" ?
I suspect that silent failure would be equally problematic to understand why there are no vectors? (I also just discovered that the 'vectorise' method could throw runtime exception)
...va/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
...rg/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorFactoryTest.java
Outdated
Show resolved
Hide resolved
.../test/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorTest.java
Outdated
Show resolved
Hide resolved
.../test/org/apache/solr/llm/texttovector/update/processor/TextToVectorUpdateProcessorTest.java
Outdated
Show resolved
Hide resolved
…ate/processor/TextToVectorUpdateProcessorFactory.java Co-authored-by: Christine Poerschke <[email protected]>
…ate/processor/TextToVectorUpdateProcessor.java Co-authored-by: Christine Poerschke <[email protected]>
Embedding is widely used in the field, but it's a bit ambiguous and to be honest, I'm with you in not using any term that can cause confusion. Do you mean I added embedding in here somewhere? If that's the case, It's a mistake, point it to me and I'll remove it! |
Added one iteration of polishing, should have addressed @cpoerschke concerns on vectorisation failures (I took inspiration from the lang detect update processor). I also removed the additional test solr config files addressing @dsmiley concerns. I'm still puzzled by the testing errors I get (the before/after problems I mentioned in the first comment), any help there would be beneficial. I think we made progress, I;ll wait some other iterations and then work on the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, we should ensure "embedding" is used at least once in the ref guide page so it's findable for users that know that term. I'm becoming familiar with it.
assertIsDenseVectorField(outputFieldSchema); | ||
|
||
ManagedTextToVectorModelStore modelStore = ManagedTextToVectorModelStore.getManagedModelStore(req.getCore()); | ||
SolrTextToVectorModel textToVector = modelStore.getModel(modelName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume looking up the model is fast/cached if it already exists?
@AfterClass | ||
public static void cleanup() throws Exception { | ||
afterTest(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird; why? Weird to see a test cleanup also be a suite cleanup.
(same for TestModelManager)
public class TextToVectorUpdateProcessorFactoryTest extends TestLlmBase { | ||
private TextToVectorUpdateProcessorFactory factoryToTest = | ||
new TextToVectorUpdateProcessorFactory(); | ||
private NamedList<String> args = new NamedList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's awkward to have this as a field, as you need to tend to its lifecycle by cleaning. It's a lightweight trivial object so why not have the test classes initialize them?
Map<String, String[]> params = new HashMap<>(); | ||
MultiMapSolrParams mmparams = new MultiMapSolrParams(params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more complex than needed; just create a ModifiableSolrParams (one line). Soon after I merge a PR up for review, you would do SolrParams.of()
.
https://issues.apache.org/jira/browse/SOLR-17632
Description
Scope of this issue is to introduce support for automatic text vectorisation in Apache Solr.
A LLM fine-tuned for sentence similarity will be accessed to vectorise the text.
Apache Solr hosts the configuration parameters to access vectorision.
Solution
An Update Request Processor that access the managed resource and call the vectorisation service for each document to process.
Tests
Unit and integration tests have been added
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.