Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-Tokenizer Is Incongruent With Solr Default StandardTokenizer #4165

Open
criminosis opened this issue Dec 3, 2023 · 0 comments · May be fixed by #4166
Open

Pre-Tokenizer Is Incongruent With Solr Default StandardTokenizer #4165

criminosis opened this issue Dec 3, 2023 · 0 comments · May be fixed by #4166

Comments

@criminosis
Copy link
Contributor

I believe the pre-tokenizer used by the Solr client code is a bug because it produces tokens that cause misalignment with Solr, in particular around text that has a domain name like foo.com. I submitted a report to the listserve and mentioned it in the JanusGraph discord. @li-boxuan suggested opening a PR after a few days of no responses. Below is a repeat of what I posted on the listserve for archival purposes as a Github Issue.


I've been working on transitioning my JanusGraph deployment's mixed index backend from ElasticSearch to Solr in order to have Solr's shard splitting functionality as my index grows.

However one of my test cases started to fail after making the switch. It was a test case that would insert a value into a property (lets call the property "text_document_property") that was a text document. In that document would be a url along the lines of "https://foo.com/abc123".

My test case would then test the retrieval of this vertex by doing a query along the lines of "g.V().has('vertex_label', 'text_document_property', textContains('https://foo.com/abc123')"

When my backend mixed index was ElasticSearch this passed.

I think that's because when the criteria is sent to ES it is sent "as-is" from JanusGraph and ElasticSearch does the tokenizing on the client. That's based on the ES client code that gets invoked here and then ultimately is converted into ES constructs here.

However for Solr it does tokenization it looks like in JanusGraph here.

When Text.tokenize is given "https://foo.com/abc123" it produces the tokens "https,foo,com,abc123"

But in the default schema.xml provided by JanusGraph, if I'm reading it correctly, dictates usage of solr.StandardTokenizerFactory.

According to Solr's docs on the Standard Tokenizer:

Standard Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

    Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.

    The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

Consequently I believe this is where an incongruence is introduced. The tokenizer when I submit my document will therefore produce a token stream of "https,foo.com,abc123" due to the first exception that's listed. I've at least verified this in the Solr admin console when submitting a query.

text_document_property_t:(https foo.com abc123) works as desired and I can see it returning the expected response in the Solr web console. However with the token stream produced by JanusGraph's Text.tokenize of "https, foo, com, abc123" producing a Solr query under the covers of text_document_property_t:(+https +foo +com +abc123) (the +'s coming from here, to convey to Solr that all tokens must be present) it's readily apparent the separate "+com" token will cause a failure to match.

If I'm understanding the schema.xml however we're already configuring Solr with a designated tokenizer for queries so I don't think we need to do this client side tokenizing? For what it's worth doing the query text_document_property_t:(https://foo.com/abc123), note the escaping I had to do, albeit in JanusGraph I believe that'll be handled here, in Solr's web console worked as expected, so presumably it's tokenizing on its backend.

As sort of evidence of this configured Solr-side tokenizer it appears JanusGraph is only doing this JanusGraph-side tokenizing for the textContains clause and not the other text predicates. Accordingly executing "g.V().has('vertex_label', 'text_document_property', textContainsPrefix('foo.com')" appears to work since it doesn't attempt to pre-tokenize the submission before sending it to Solr.

It appears the introduction of using Text.tokenize was introduced back in 2014. Instead of using the lower case of the string as given prior to then comment. It seems Solr 4 was released in 2012 with on going releases into 2014. Solr 4 broke with the original Tokenizer (now called Classic Tokenizer), so I'm wondering if this incongruence wasn't noticed at the time if JansusGraph (Titan back then) was still primarily targeting Solr 3.

Happy to port this into a Github issue if there's agreement it's not working as intended. On the surface seems like it could be a simple code change if all it takes is remove the special casing around Text.Contains and fold it into the same handling logic as the other Text predicates and let the Solr-side tokenizer do the work.

criminosis added a commit to criminosis/janusgraph that referenced this issue Dec 3, 2023
Closes JanusGraph#4164

Added new test document and updated assertions

Use tokenizer that mimics Solr's standardized tokenizer

Geo predicate tweaking

Implemented TextContainsPhrase for Solr
@criminosis criminosis linked a pull request Dec 3, 2023 that will close this issue
9 tasks
criminosis added a commit to criminosis/janusgraph that referenced this issue Dec 3, 2023
Closes JanusGraph#4164

Added new test document and updated assertions

Use tokenizer that mimics Solr's standardized tokenizer

Geo predicate tweaking

Implemented TextContainsPhrase for Solr

Signed-off-by: Allan Clements <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant